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Abstract 

Large distributed systems such as Computational and Data Grids require a substantial amount of mon- 
itoring data be collected for a variety of tasks such as fault detection, performance analysis, perfor- 
mance tuning, perfonmancc prediction, and scheduling. Some tools arc currently available and others 
are being developed for collecting and forwarding this data. The goal of this paper is to describe a 
common architecture with all the major components and their essential interactions in just enough 
detail that Grid Monitoring systems that follow the architecture described can easily devise common 
APIs and wire protocols- To aid implementation, we also discuss the perforniance characteristics of a 
Grid Monitoring system and identify areas that are critical to proper functioning of the system. 

]«0 Introduction 

The ability to monitor and manage distributed computing components is critical for enabling 
high-performance distributed computing. Monitoring data is needed to detemiine the source of performance 
problems and to tune the system and application for better performance. Fauh detection and recoveiy 
mechanisms need monitoring data to determine if a server is down, and whether to restart the server or redirea 
service requests elsewhere [141[101. A performance prediction service might use monitoring data as inputs for 
a prediction model [1 <5], which would in turn be used by a scheduler to determine which resources to use. 

There are several groups that are developing Grid monitoring systems to address this problem [11] 
[16][9][14] and these groups have recently seen a need to interoperale. In order to facilitate this, we have 
developed an architecture of monitoring components. A Grid monitoring system is differentiated from a 
general monitoring system in that it must be scalable across wide-area networks, and include a wide range of 
heterogeneous resources. It must also be integrated with other Grid middleware in terms of naming and 
security issues. Wc believe the Grid Monitoring Architecture (GMA) described here addresses these concerns 
and is sufficiently general that it could be adapted for use in distributed enviroiunents other than the Grid. For 
example, it could be used with large compute farms or clusters that require constant monitoring to ensure all 
nodes are running correctly. 

2.0 Design Considerations 

With the potential for thousands of resources at geographically different sites and tens-of-thousands of 
simultaneous Grid useis, it is important for the data management and collection facilities to scale while, at the 
same time, protecting the data from spoiling. 

In order to allow scalability in both the administration and performance impact of such a system, the 
decision-making as to what is monitored, measurement frequency, and how the data is made available to the 
public must be widely distributed and dynamic. Thus, instead of a centralized management component, 
multiple independent management components synchronize their state through a directory service, which may 
itself be distributed. Distributing management in this fashion also helps mininaize the effects of host and 
network failure, making the system more robust under precisely the kinds of conditions it is trying to detect. 

In some models, such as the CORB A Event Service, all communication flows through a central component, 
which represents a potential bottleneck. In contrast, we propose that performance event data, which makes iq> 
the majority of the communication traffic, should travel directly from the producers of the data to the 
consumers of the data. In this way, individual producer/consumer pairs can do -impedance matching- based 
on negotiated requirements, and the amount of data flowing through the system can be controlled in a precise 
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and localized fashion based on current load considerations. The design also allows for replication and 
reduction of event data at intermediate components acting as consumer/producer caches or fihers. Use of these 
intermediate components lessens the load on producers of event data that is of interest to many consumers, 
with subsequent reductions in the network traffic, as the intermediaries can be placed "near^ the data 
consumers. The directory service contains only metadata about the performance events and system 
componenis and is accessed relatively infrequent ly/reducing the chance that it would be a bottleneck. 

We also considered a purely SNMP-based solution for monitoring, but rejected it because we felt that the 
SNMP*s simple GET/SET model is not rich enough, as there is no support for subscription. Also, it is not 
clear that security model maps well to the Grid Security Infrastructure. However, we definitely envision the 
use of SNMP-based tools as a source of monitoring data. 

3.0 Architecture 

The GMA architecture s^^ports both a producer/consumer model, similar to several existing Event Service 
systems such as the CORBA Event Service [1], and a query/response model. For either model, producers or 
consumers thai accept connections publish their existence in a directory service. Consumers use the directory 
service to locate one or more producers generating the type of event data they are imerested in. Each consumer 
then subscribes to or queries the matching producer(s) directly. Likewise, a producer may query the directory 
service to locate consumer(s) that accept and process event data in a given manner - for example, a ccmsumer 
that archives event data for later analysis. Once the appropriate consumer is identified, the producer wbuld 
connect to it directly and stream the event data - similar in behavior to when a consumer subscribes to a 
producer, but initiated by the producer. 

3.1 Terminology 

The monitoring data that the GMA is designed to handle are timestamped events. An event is a named 
collection of data. The data may relate to anything, but common events will be memory usage, network usage, 
or -ciTor^ conditions such as a server process crashing. The producer is the component that makes the event 
data available. A consumer is any process that requests or accq>ts event data. A directory service is used to 
publish what event data is available and which producer to contact to get it. AU of these components are 
described in detail below. 

3.2 Components 

The architecture consists of the following 
components, shown in Figure 1: 

• consmners 

• producers ' 

• directory service 

By defining three interfaces: the consumer to 
producer interface, the consumer to direaory 
service interface, and the producer to directory 
service interface; we can build "standard" grid 
monitoring services that will alt inter-operate. 

Directory Service 

To locate, name, and describe the structural 
characteristics of any data available to the Grid, a 
disuibuted directory service for publishing this 
information must be available. The primary 
purpose of this directory service is to allow 



event pubfication 
inf ormatio n 




event publication 
infofinfltion 



Figure 1: Grid Monitoring Architecture 
Components 



purpose OI mis OUCWWijr " w 

information consumers (users, visualization tools, programs and resource schedulers) to discover and 
understand the characteristics of the information that is available. In addhion, infonnation producers must be 
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able to update the information to reflect the system state. In the context of common operations for both 
consumers and producers, they will be collectively referred to zscUents. 

The directory service contains a listing of all available event data and their associated producers. This 
allows consumers to discover what event data are currently available (through the producer registration), what 
the characteristics of the data are. and which producer) to contact to receive a given type of event data. 

The directory service, however, is not responsible for the storage of performance data itself— only its name 
and other characteristics. We assume the names and characteristics associated with dynamic performance data 
change slowly (unlike the performance data itself). That is, the name and stmctural characteristics of a data set 
remain relatively constant while the valid contents of the data set may change dramatically over time. 

The functions supported by a directory service are: 

1 . Authorize-consumcr - Establish identity of a consumer, which is in turn mapped to access permissions 
for the next, or possibly several subsequent, transaction(s). 

2. Authorize-producer- <same as Authorize-consuroer>, although different mechanisms may be used for 
the two authorization operations. 

3. Search - Perform a search for event data. The client should indicate whether only one result, or more 
than one result, if available, should be returned. An optional extension would allow the client to gpt 
multiple results one element at time using a "get niext- query in subsequent searches. 

• Preconditions: The client is authorized to perform the search. 

• Postconditions: The rcsult(s) of the search are returned, which includes a well-defined null value 
for searches which did not match in the directory. 

4. Add -Add a record to the directory. 

• Preconditions: The client is authorized to add the record. The record confonns to the directory's 
schema. The record is not a duplicate. 

• Postconditions: The record is in the directory. 

5 Remove - Remove a record from the directory. 

• Preconditions: The client is authorized to remove the record. The record matches exactly one 
record in the directory. 

• Postconditions: The record is not in the directory. 

6 Update -Change the state of a record in the directory. 

• Preconditions: The client is authorized to modify the record. The record matches exacUy one 
record in the directory. 

• Postconditions: The record now has the new values. 

7. Version request- A client may request the current version of the interface. The version numbering sys- 
tem is TBD. 

Query-optimized directory services such as LDAP [15], Globus MDS [3], the Legion Information Base, 
and the Novell NDS, all provide the necessary base functionality for this service, but only in their fully 
distributed implementations. Some public-domain implementations of these services do not support 
distributed implementation. 

consumer 

A consumer is any program that receives event data from a producer. Consumers that wiU accept 
asynchronous requests from producers wUl publish this information in the directory service. The 
functions supported by a consumer are: 

1. Authorize to producer - The consumer contacts a producer and proves its identity. This may need to be 
performed once per "session*", or on every request. 

2. Authorize from producer - The consumer accepts authorization requests fron? a producer and verifies 
, its identity. As in Authorize to producer^ this may be done once per session or on every producer-initi- 
ated request. < 
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3. Query - The consumer receives one event or set of events from the producer. Optional extensions are 
z filter xo indicate interest in only a subset of events or to perform transformations on event data. 

• Preconditions: The consumer is authorized to reveive these event(s). The event data is available. 

• Postconditions: One or more events are returned* together, in the reply. 

4. Consumer-initiated Subscribe - The consumer establishes a cormection to the producer to. receive 
events in a stream. 

• Preconditions: The consumer is authorized to connect to the producer and receive these event(s). 
The event data is available. 

• Postconditions: Same as for Query, except that in addition to returning the most recent event, on 
success the producer will either (a) return events in a stream over the connection used for the request 
or (b) inform the consumer of the location of a new connection from which it can read the stream of 
events. 

• Other behaviors: If the consumer closes the established connection, the producer should simply 
consider the subscription ended (generating no errors). If the underlying source of event data stops 
producing data, the producer may close the connection without warning, so consumers should be 
designed to recover gracefuUy in this instance. 

.5. Consumer- initiated Unsubscribe — The consumer tells a producer to close the subscription. An 
optional extension is a **close all" version which closes all subscriptions for this consumer. 

• Preconditions: The subscription exists for the producer/consumer pair. The consumer is authorized 
to end it. 

• Postconditions: The subscription is removed. No more data should be sent for this subscription 
after the producer has confirmed. 

6. Producer-initiated Subscribe - The consumer accepts subscriptions from producers who wish to send 
events. 

• Preconditions: The producer is authorized to send events to this consumer. 

• Postconditions: A new subscription is created for this producer/consumer pair. 

7. Producer-initiated Unsubscribe - The consimier accepts an unsubscribe request from the producer. 

• Preconditions: The subscription exists. The producer is authorized to end it. 
Postconditions: The subscription is removed. 

8. Authorize to directory - The consumer contacts the directoiy service and proves its identity. This may 
need to be performed once per "session*" or on every lookup. 

9. Lookup - The consumer makes a query to the directory service, of which at least 2 types should be 
available: (1) producer get data for a producer associated with an event. (2) event: get the description 
of the event. 

• Preconditions: Authorization has been performed. 

• Postconditions: The direaory service is unchanged (read-only operation). 

10. Update - The consumer updates records in the directory service regarding events for which this coi>- 
sumer will accept producer-initiated subscriptions. 

• Preconditions: Authorization has been performed. 

• Postconditions: The directory service has more/Iess/modified records reflecting the new informa- 
tion. 

There are many possible types of consumers. These may include: 

• real-lime monitor: This consumer is used to collect monitoring data in real time for use by real-time 
analysis tools. It checks the directory service to see what data is available, and then "subscribes** to all 
the events it is interested in. The producers then send the event data to the consumer as it is generated. 
Data from many sources can then be used for real-time performance analysis. 

• arcbiver: This consumer may be used as to collect data for the archive service. It subscribes to the pro- 
ducers, collects the event data, and places it in the archive. We note that a monitoring architecture needs 
this component, as it is important to archive event data in order to provide the ability to do historical 
analysis of system performance, and determine when/where changes occurred. While it may not be 
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desirable to archive all monitoring data, it is desirable to archive a good sampling of both "normal" and 
"abnormar system operation, so that when problems arise it is possible to compare the current system 
to a previously working system. It this architecture, the archive is just another consumer. 

• process monitor: This consumer can be used to trigger an action based on an event from a server pro- 
cess. For example, it might run a script to restart the processes, send email to a system administrator, 
call a pager, etc. 

• overview monitor: This consumer collects events from several sources, and uses the combined infor- 
mation to make some decision that could not be made on the basis of data from only one host. For 
example, one may want to trigger a page to a system administrator at 2 AM, only if both the primary 
and backup servers are down. 

producer 

Producers are responsible for providing event data to consumers, either by request or asynchronously. 
Producers will publish event availability information in the directory service. The junctions supported by a 

producer are: ^ . 

1. Authorize from consumer - The producer establishes a consumer's identity and access permissions. 
Authorization may be combined with subscription or query requests, or performed separately with the 
results stored in a shared "key" of some sort. 

2. Authorize to consiuner - The producer contacts a consumer and proves its identity. As for Authorize to 
producer^ this may need to be performed once per session or on every new request. 

3. Query - The producer returns a single set of evcnt(s) in response to a consumer query. 

• Preconditions: The consumer is authorized to receive data about the event. 

• Postconditions: The event data, if present. Is returned. 

4. Consumer-initiated Subscribe - Accept consumer requests to establish a stream of event data (sub- 
scription).This request should include parameters and filte;rs, etc. 

• Preconditions: Consumer is authorized to subscribe to requested event data. 

• Postconditions: The subscription Is added for a consumer, and the producer either (a) returns 
events in a stream over the connection used for the request or (b) informs the consumer of the loca- 
tion of a new connection from which it can read the stream of events. 

5. Consumer-initiated Unsubscribe - This is the normal operation by which a consumer ends its subscrip- 
tion. An optional "unsubscribe air extension would allow the consumer to cancel all its subscriptions 
at once. As mentioned in the consumer section, if a consumer summarily closes its connection, the 
producer should automatically unsubscribe it everywhere. 

• Preconditions: The subscription exists for this producer/consumer pair. 

• Postconditions: The consumer/producer pair has one less subscription. 

6. Producer- initiated Subscribe - A producer asynchronously begins a subscription with a consumer. 

• Preconditions: The producer is authorized to send data to the consumer. 

• Postconditions: The subscription is added and the producer may now send data. 

7. Producer- initiated Unsubscribe - The producer informs a consumer that the subscription is ending. 

• Preconditions: The subscription exists for this consumer/producer pair. The consumer supports this 
function, allowing producers to asynchronously imsiibscribe. 

• Postconditions: The subscription is removed. Note that even in the case of failure the subscription 
may be removed by the producer. 

8. Version - A consumer may request the current version of the interface. The version numbering system 
is TBD. 

Producers can service "streaming** or "query^ requests from consumers. In streaming mode the consumer 
makes a single request, then receives events in a stream until an explicit action is taken to end the connection. 
In query mode the consumer makes a single request and receives a single event in reply. 
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The producers are also used to provide access control to the 
event data, allowing different access to different classes of 
users. Since Grids typically have multiple organizations 
controlling the resources being monitored there may be 
different access policies (firewalls possibly), support for 
different frequencies of measurement, and willingness to 
allow access to different performance details for consumers 
"inside" or "outside" of the organization running the 
resource. Some sites may allow internal access to real-time 
event streams* while providing only summary data off-site. 
The producers would enforce these types of policy decisions. 
This mechanism is especially important fpr monitoring 
clusters or computer farms, where there may be a large 
amount of internal monitoring, but only a limited amount of 
monitoring data accessible to the Grid. 

There may also be components that are both consumers Figure 2: Joint Consumer/Producer 
and producers. For example a consumer might collect event 

data from several producers, and then use that data to generate a new derived event data type, which is then 
made available to other consumers, as shown in Figure2. 

Sources of Event Data 

There are many possible sources of event data, including monitoring sensors. The following is a summaiy 
of common types of sensors: 

• host sensors: These sensors perform host monitoring tasks, such as monitoring CPU load, available 
memory, or TCP retransmissions. Host sensors may be layered on top of SNMP-based tools, and there- 
fore run remotely from the host being monitored. Host sensors could also be used to monitor host con- 
figuration information, such as what versions of the operating system or other software package are 
installed. 

• network sensors: These sensors perform SNMP queries to a nctworic device, typically a router or 
switch. Information on which device statistics are being monitored is published in the directory service. 

• process sensors: Process sensors generate events when there is a change in process status (for example, 
when it starts, dies normally, or dies abnormally). They might also generate an event if some dynamic 
threshold is reached (for example, if the average number of users over a certain time period exceeds a 
given threshold). 

• application sensors: Autonomous sensors can also be embedded inside of applications. These sensors 
might generate events if a static threshold is reached (for example, if the number of locks taken exceeds 
a threshold), upon user connect/disconnect or change of password, upon receipt of a UNIX signal, or 
upon any other user-defined event. Aj^lication sensors can also be used to collect detailed monitoring 
data about the application to be used for performance analysis. These types of sensors may not register 
themselves with the directory service, but could still feed their results to the system. A special case of 
application sensors would be library sensors that would be embedded in library code and compiled into 
the application. 

• storage or I/O sensors: These sensors perform any monitoring of storage systems such as disks and 
tapes, obtaining information on block size, access time, seek time, etc. 
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middleware sensors: These sensors would gather information about middleware services such as 
directory and authentication servers. They couJd report request volume, average service time, number 
of requests returned due to timeouts, etc. and would be used to discover and repair performance prob- 
lems in this service layer. 
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Figure 3: Relationship of Producers and Sensors 

A producer may be associated with a single sensor, all sensors on a given host, all sensors on a given subnet, 
or any arbitrary group of sensorc. This is not defined by the architecture, but is left as an implementation 
decision. Figure 3 shows one example of how this might be implemented. We note that there are scalabilit^r 
and reliability issues with how this is implemented, as described below. 

Optional Producer Tasks 

There are many other services that producers might provide, such as event filtering and caching. For 
example, producere could optionally perform any intermediate processing of the data the consumer might 
require, *A consumer might request that a prediction model be applied to a measurement history from a 
particular sensor, and then be notified only if the predicted performance falls below a specified threshold. The 
producer might in this case filter the data for the consumer and deliver it according the schedule the consumer 
determines. Another example is that a consumer might request that an event be sent only if it*s vahie crosses a 
cenain threshold. Examples of such a threshold would be if CPU load becomes greater than 50%, or if load 
changes by more than 20%. The producer might also be configured to compute summary data. For example, it 
can compute 1,10, and 60 minute averages of CPU usage, and make this information available to consumers. 
Information on which services the producer provides would be published in the directory server, along with 
the event information. 

Protocols 

The next step is to defme what the protocol for consumer to producer communication, and for consumer 
and producer to the directory service communication. For example, current proposals include using LDAP for 
communicating with the directory service, and SOAP for subscribe requests. These issues wUI be addressed in 
future Global Grid Forum Performance Working Group documents. 
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4.0 Sample Use 

An example use of the GMA is shown in Figure 4. Event data is collected on each host and at all network 
routers between them, and aggregated at a producer, which registers the availability of the data in the direaory 
service. A real-time monitoring consumer subscribes to all this event data for real-time visualization and 
performance analysis. The producer is capable of computing summaries of network throughput and latent 
data, enabling a ''network-aware*' client [ll] to optimally set its TCP buifer size. A subset of the producer's 
data! that from from the **5erver^ and "^router** nodes, is also sent to an archive. 




Figure 4: Sample Use of Monitoring System 



5.0 Implementation Issues 

The purpose of a monitoring system is to reliably deliver timely and accurate information without 
perturbing the system. Therefore the architecture must consider performance issues explicitly and make 
recommendations and requirements of implementations with the goal of avoiding services which look good 
on paper but which fail in practice. We discuss several of these implementation design issues below. 

Monitoring service characteristics 
The following characteristics distinguish performance monitoring information from other system data, such 

as flies and databases. 

PerformaDce information has a fixed« often short lifetime of utflity. Most monitoring data may go stale 
quickly making rapid read access important, but obviating the need for long-term storage. The notable 
exception to this is data that gets archived for accounting or post-mortem analysis. 

• Updates are frequent. Unlike the more static fonns of "metadata,** dynamic performance information 
is typically updated more frequently than it is read. Most extant information-base technologies are opti- 
mized for query and not update, making them potentially unsuitable for dynamic information storage. 

• Performance information is often stochastic. It is frequently impossible to characterize the perfor- 
mance of a resource or an application component using a single value. Therefore, dynamic performance 
information may carry quality-of-mformation metrics quantifying its accuracy* distribution, lifetime, 
etc., which may need to be calculated from the raw data. 
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• Data gathering and delivery mechanisms rousl be high-performance. Because dynamic data may 
grow stale quickly, the data management system must minimize the elapsed time associated with stor- 
age and retrieval. Note that this requirement differentiates the problem of dynamic data management 
from the problem of providing an archival performance record. The elapsed time to read an archive, 
while important, is often not the driving design characteristic for the archival system. Wc believe that 
archival data is useful both for accounting purposes and for long-term trend analysis. It is our belief; 
however, the separate but complimentary systems for managing and archiving Grid performance data 
respectively are required, each tailored to meet its own set of unique performance constraints. 

• Performance measurement impact must be minimized. There must be a way for monitoring facili- 
ties to be able to limit their intnisiveness to an acceptable fraction of the available resources. If no 
mechanism for managing performance monitors is provided, performance measurements may simply 
measure the load introduced by other performance monitors. 

General Implementation Strategies 

A number of the authors of this white paper have tniih various monitoring systems. The following lessons 
. have been learned from this experience, and should be considered when implementing a monitoring system. 

• The data management system must adapt to changing performance conditions dynamically. 
Dynamic performance data is often used to determine whether the shared Grid resources are performing 
well (e.g. fault diagnosis) or whether Grid load will admit a particular application (e.g. resource alloca- 
tion and scheduling). To make an assessment of dynamic performance fluctuation available, the data 
management system cannot, itself, be rendered inoperable or inaccessible by the very fluctuations it 
seeks to capture. As such, the data management system must use the data it gathers to control its own 
execution and resources in the face of dynamically changing conditions. 

• Dynamic data cannot be managed under c^tralized control. Having a single, centralized repository 
for dynamic data (however short its lifespan) causes two distinct performance problems. The first is that 
the centralized repository for information and/or control represents a single-point-of-failure for the 
entire system. If the monitoring system is to be used to detect network failure, and a network failuie iso- 
lates a centralized controller from separate system components, it will l>e unable to fulfill its role. All 
components must be able to function when temporarily disconnected or unreachable due to network or 
host failure. For example, a producer must still be able to accept connections from consumers even if 
it's connection to sensors or the directory server is down. In addition, once access is restored, producers 
must be able to reconfigure themselves automatically with respect to the rest of the running service 
components. A second problem with centralized data management is that it forms a performance bottle- 
neck. For dynamic data, writes often outnumber reads. That is, performance data may be gathered that 
is never read or accessed since demand for the data caimot be predicted. Experience has shown that a 
centralized data repository simply cannot handle the load generated by actively monitored resources at 
Grid scales. 

• All system components must be able to control their intrusiveness on the resources tbey monitor. 

Different resources experience varying amoxints of sensitivity to the load introduced by monitoring. A 
two megabyte disk footprint may be insignificant within a 10 terabj^e storage system, but extremely 
significant if implemented for a palm-top or RAM disk. In general, performance monitors and other 
system components must have tunable CPU, communication, memory, and storage requirements. 

• Efficient data formats are critical. In choosing a data format, there are trade offs between ease-of--use 
and compactness. While the easiest and most portable format may be ASCII text including both event 
item descriptions and event item data in each transmission, this also the least compact. This format may 
be suitable for cases where a small amount of data is recorded and transmined infrequently. However, 
some sources of event data can generate huge volumes of data in a short amount of time, demanding 
that a more efficient data format be adopted. Compressed binary representations that can be read on 
machines with different byte orders is one possibility. Transmitting only the item data values and using 
a data structure obtained separately to interpret the data is another way to reduce the data volume. 
XML is an emerging standard that allows the data description to be separated from the data values^ The 
XML schema could be placed in a separate directory server, retrieved, and used in conjunction with the 
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event data values. Another possibility is to send the data descriptor one time when a consumer sub- 
scribes to a producer, and send only the data values for each event transmission. The GMA could sup- 
port registration of a data format for each event, allowing different events to use the format most 
appropriate for their needs. Consumers could be provided plug-in modules to convert from one format 
to another. 

ScalabiUty 

One of the biggest issues in defining a monitoring architecture for use in a Grid environment is scalability. 
It is critical that the act of monitoring has minimal affect on the systems being monitored. In this model, one 
can add additional producers and additional directory servers as needed* reducing the load where necessary. In 
the case where many consumers are requesting the same event data, the use of a producer reduces the amount 
of work on and the amount of network traffic from the host being monitored. As such, the resources that a 
producer will use must, themselves, be scheduled. A producer might l>e run on a separate host from the Grid 
resources, to ensure that the load from the producer did not affect what was being monitored. 

In particular, we believe that the GMA is more scalable than the CORBA Event Service. In the GMA, event 
data is not sent anywhere unless it is requested by a consumer. Many of the current event service systems, 
including CORBA, send all event data to a central component, which consumers then contact. In the GMA, 
only event data subscription information (i.e.: which producer to contact) is sent to a central directory server. 
Event data goes directly from producer to consumer: We believe this model will scale much better in a Grid 
environment. 

In addition, for the GMA system to scale, performance monitoring consumers (pairticularly those that 
require the cooperation between two or mote producers) must coordinate their interactions to control 
intrusiveness. For example, if network performance is to be monitored between all pairs of hosts attached to a 
single Ethernet segment, the network probes required to generate end-to-end measurements cannot occur 
simultaneously. If they do, both the quality of the readings that are g;athered and the network capacity that is 
available for other work will suffer. If performance monitors are not coordinated in the Grid, the intrusiveness 
of performance monitoring may strongly impact available performance, particularly as the system scales. That 
is, if all performance facilities operate their own monitoring sensors. Grid resources will be consumed by the 
monitoring facilities alone. Coordinating a Grid-wide collection of sensors is complicated both by the scale of 
the problem (there are many Grid resource characteristics to monitor) and by the dynamically changing 
performance and availability of Grid resources that are being used to implement the dynamic data 
management service. 

One recommended producer service that is important for system scalability is that of consumer-specified 
caching. Of^en a consumer needs to access only a small subset of the global data pool, and will sacrifice fast 
access for tight data consistency. An automatic program scheduler, for example, might want the **freshest** 
data that can be delivered for a specified set of hosts with no more than a one second access delay. To achieve 
this functionality at Grid scales, producers must cache the data the consumer will want and deliver whatever 
data is available at the time of request. Experience with dynamic program scheduling indicates that this type 
of producer is valuable to scalable performance within the Grid [2]. 



Security Issues 

A distributed system such as this creates a number of security vulnerabilities which must be analyzed and 
addressed before such a system can be safely deployed on a production Grid. The users of such a system are 
likely to be remote from the machines being monitored and to belong to different organizations. 

Typical user actions will include queries to the directory service concerning event data availability, 
subscriptions to producers to receive event data, and requests to instantiate new event monitors or to adjust 
collection parameters on existing monitors. In each case, the domain that is being monitored is likely to want 
to control which users may perform which actions. 
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Public key based X.509 identity certificates [6] are a recognized solution for cross- realm identiTicatioh of 
users. When the certificate is presented through a secure protocol such as SSL (Secure Socket Layer), the 
server side can be assured that the connection is indeed to the legitimate user named in the certificate. 

User (consumer) access at each of the points mentioned above (directory lookup and requests to a 
producer), would require an identity certificate passed though a secure protocol, e.g. SSL. A wrapper to the 
directory server and the producer could both call the same authorization interface with the user's identity and 
the name of the resource the user wants to access. This authorization interface could return a list of allowed 
actions* or simply deny access if the user is unauthorized. Conununication between the producer and the 
sensors may also need to be controlled, so that a malicious user can not communicate directly with the 
monitoring process. 

6.0 Related Work 

There are many existing systems with an event model similar to the one described here. CORBA includes 
an "event service"^ [1] that has a rich set of features, including the ability to push or pull events, and the ability 
for the consumer to pass a filter to the event supplier. JINI also has a ^^Distributed Event Specification"* [7], 
which is a simple specification for how an object in one Java™ virtual machine (JVM) registers interest in the 
occurrence an event occurring in an object in some other JVM, and then receives a notification when that 
event occurs. There are also several other systems with alternative event models, such as the Common 
Component Architecture; many of which are summarized in [8]. However, we believe that none of the 
existing systems is a perfect match for a Grid monitoring system; therefore we have tried to combine the 
. relevant strengths of each. Another related system is Autopilot [9], which has had the notion of sensors for 
several years, and which implements a similar publish/lookup/subscribe architecture. Note that this list of 
systems is not intended to be exhaustive, but only illustrative of the usefulness of the proposed architecture. 

7.0 Acknowledgements 

Input from many people went into this document, including almost all attendees of the various Grid Forum 
meetings. The LBNL portion of this paper was supported by the U. S. Dept of Energy, Office of Science, 
Office of Computational and Technology Research, Mathematical, Information, and Computational Sciences 
Division, under contract D£-AC03*76SF00098 with the University of California. 



\ 



II 



Febniary 27» 2001 



8.0 References 

[I] CORBA, "Systems Management: Event Management Service"^ X/Open Document Numben P437. 
hf*p//www.opengroup.org/onlinepubs/00g356299/ 

[2] Dail, H, G. Obenelli, F. Bennan» R. Wolski, and A. Grimshaw **App]icdtion-Aware Scheduling of a Mag- 
netohydrodynamics Application in the Legion Metasystem", Proceedings of the 9*^ Heterogeneous Com- 
puting Workshop, May 2000. 

(3) S. Fitzgerald, I. Foster, C. Kesselman, G. von Laszewski, W. Smith, and S. Tuiecke, **A Diredoiy Service 
for Configuring High-Performance Distributed Computations**. In Proceedings &^ IEEE Symposium on 
High Performance Distributed Computing, August J 997. 

lA] The Globus project: See http:/Avww.globus.oni 

15] The Grid: Blueprint for a New Computing Infrastaicture, edited by Ian Foster and Carl Kessebnan. Mor- 
gan Kaufmann.Pub. August 1998. ISBN 1-55860-475-8. 

[6] Housely, R., W. Ford, W. Polk. D. Solo. "Intemet X.509 Public Key Infrastracture**, IETF RFC 2459. Jan. 
1999 

[7] Jini Distributed Event Specification**, http://www sun.com/iini/specs/ 

|8] Peng, X. "Survey on Event Service**. http://www.mitx.mcs.anLpov/^wn^survev.html 

[9] Randy L. Ribler, Jeffrey S. Vetter, Huseyin Simitci. and Daniel A. Reed, "Autopilot: Adaptive Control of 
Distributed Applications.** Proceedings of the 7th IEEE Symposium on High-Performance Distributed 
Con^)uting. Chicago. IL. July 1 998. 

[10] W. Smith. "Monitoring and Fault Management.** hMD://www.nas.nasa.pov/~wwsinith/inon fin 

[II] Tiemey. B.. B. Crowley. D. Gunter. M. Holding, J. Lee. M. Thompson A Monitoring Sensor Manage- 
ment System for Grid Environments Proceedings of the IEEE High Perfonnance Distributed Computing 
conference (HPDC-9), August 2000. LBNL-45260. 

[12] Tiemey, B. Lee, J., Crowley. B.. Holding, M., Hylton, J., Drake, F.. "A Network-Aware Distributed Stor- 
age Cache for Data Intensive Environments'*, Proceeding of IEEE High Performance Distributed Com- 
puting conference {HPDC-8), August 1999, LBNL-42896. bnp://www^idc.]bl.gov/DPSS/ 

113] Tiemey, B., W. Johnston, B. Crowley, G. Hoo, C. Brooks, D. Gunter. *The NetLogger Methodology for 
High Performance Distributed Systems Performance Analysis**, Proceeding of IEEE High Performance 
Distributed Computing conference. July 1 998. LBNL-4261 1 . httD://www.didc.lbl. e ov/NetLogger/ 

[14] A. Waheed, W. Smith. J. George. J. Yan. "An Infirastnicture for Monitoring and Management in Computar 
tional Grids." In Proceedings of the 2000 Conference on Languages, Compilers, and Runtime Systems 

1 1 5] Wahl M., Howes. T.. Kille S., "Lightweight Directory Access Protocol (v3)**. Available from 
ftp!//ftp isi.edii/in>notes/^fc2251 .txt 

l\6] Wolski, R., Spring, N., Hayes. J., "The Network Weather Services: A Distributed Resource Performance 
Forecasting Service for Metacomputing,'* Future Generation Computing Systems, 1999. 
hnp://pyw,npaci.^ti/ 



12 



Febmary37,2001 



3LS - A Peer-to-Peer Network Simulator 



Nyik San Ting, Ralph Deters 
Department of Computer Science 

University of Saskatchewan 

Saskatoon, Saskatchewan, 

S7N 5A9 Canada. 
nyt43 1 @mail.usask.ca 
,deters@cs.usask.ca 



Abstract 

Peer-to-Peer (p2p) networks are the latest addition to the 
already large distributed systems family. With a strong 
emphasis on self-organization, decentralization and 
autonomy of the participating nodes, p2p-networks tend to 
be more scalable, robust and adaptive than other forms of 
distributed systems. The much-publicized success of pip- 
networks for file-sharing and cycle-sharing have resulted 
in an increased awareness and interest into the p2p 
protocols and applications. However, p2p-networks are 
difficult to study due to their size and the complex 
interdependencies between users, application, protocol and 
network. This paper presents a 3-level simulator designed 
to study complex p2p networks. 

1. P2P-Network Simulation 

The field of P2P networks is still undergoing major 
changes with new applications and protocols emerging on a 
nearly monthly basis. However, due to the difficulties in 
evaluating them prior to their large-scale deployments, they 
are often short-lived - disappearing as fast as they emerge 
- normally due to bad performance. What seemed to work 
well when using a small number of nodes, high bandwidth, 
low latency, attractive services/content and highly 
cooperative users often fails in real world deployments. 
Testing a system performance prior to its deplo)anent is a 
fairly common element in the software development of 
applications. 

P2P networks tend to be large, heterogeneous systems with 
complex interactions between the physical machines, 
underlying network, application and user. Hence, testing of 
a ^'running" p2p-network or protocol in a realistic 
environment is often not feasible. However, it is possible to 
use a simulation of a p2p-network to evaluate the 
applications and protocols in controlled environment. 
Researchers, who wanted to simulate a p2p system, tend to 
avoid the development of a complex simulator and focus 
on some selected areas (such as caching schemes). While 



some may choose to start an implementation from scratch 
an increasing number of researchers build their simulators 
on top of existing tools (e.g. the agent platform JADE [2]) 
to speed-up the development. The general problem of 
having only special-purpose simulators is that the results 
obtained with one simulator are difficult to validate and 
often impossible to achieve with another simulator due to 
the many hard-coded assumptions of every simulator. 
Figure 1 shows a high level view of the 3LS simulator. 3LS 
is a time-stepped simulator that uses a central step-clock is 
used to simulate the timing. In 3LS the models for network, 
p2p protocol and user model are clearly separated. With the 
separation of the network, protocol and application model 
from each other, the simulation of various network 
topologies, for different protocol, applications, and user 
models becomes possible. Hence, three levels have been 
defined: 

• Network level (bottom), 

• Protocol level (middle) and 

• User level (top). 

Communication can only happen between the directly 
connected levels. The protocol-level, that is responsible for 
simulating the p2p-protocol and application, and serves as 
the interface between the user-level and the network-level. 
Input information from the user is fed into the network 
level through a GUI interface or a file. Upon starting the 
simulator it is possible to either create the models (fig. 2) 
for the three described levels or to choose among a library 
the ones most suited models/combination for the simulation 
run. As the simulation is running, the events are displayed 
on the command prompt screen. After the simulation has 
been completed, all simulation data is saved into a file for 
future analysis. Though simulation languages provide most 
of the features needed in programming a simulation model 
and the details of the simulation models can be easily 
changed, a general-purpose language was selected to 
provide "greater programming flexibility". Since Java is 
the preferred language of many p2p programmers it was 
chosen as the host-language for the 3LS simulator. 
Visualization of the network is done with the aid of the tool 
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AiSee [1]. AiSee was selected for its, ease of use, simple 
installation, availability (runs under various OS), 
functionality and performance in rendering. When 
screenshots of the p2p-network are to be visualized, files 
containing the information of the graph are created by 3LS 
using the Graph Description Language (GDL). Once the 
file is created a user can use AiSee to render an image of 
the graph (see figure 3). 



are currently , testing the simulator by comparing its results 
for a Gnutella 0.4 network [3] with the "real data" obtained 
from running Gnutella 0.4 clients in a controlled network. 
Using Comtella [4] clients we are able to adjust the various 
parameters of the simulation and verify the simulation 
results. Early results in a small network (less than 20 
nodes) indicate that the simulator works as expected but 
more testing is needed. 
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Figure 1 : Architecture of the 3LS 
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Figure 2: Screenshot of 3LS - Network-Model 

2. Future Work 

Future work focuses on collecting data for the various 
layers e.g. human desktop usage and network traffic. We 



3, Code 

The complete code of the 3LS simulator is available 
upon request by sending an email to one of the authors. 
3LS requires a Java 1.3.1 or higher version of the JDK. 
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Figure 3: Example of network view using AiSee. 
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Grid Orga nization 



CROSS-REFERENCE TO RELATED APPLICATIONS 

This application incorporates by reference the content of U.S. Provisional Application 
No. 60/490,818, Express Mail Number, EV 33i001684 US, filed July 28, 2003, to Erol Bozak et 
al, entitled GRID COMPUTING MANAGEMENT. 

TECHNICAL FIELD 

The present invention relates to data processing by digital computer, and more 
particularly to a dynamic tree structure for grid computing. 



In today's data centers, the clusters of servers in a client-server network that run business 
applications often do a poor job of managing unpredictable workloads. One sender may sit idle, 
while another is constrained. This leads to a "Catch-22" where companies, needing to avoid 
network bottlenecks and safeguard connectivity with customers, business partners and 
enq[>loyees, often plan for flie highest spikes in workload demand, then watch as those surplus 
servers operate well under capacity most of the time. 

In grid computing, all of the disparate conq)uters and systems in an organization or 
among organizations become one large, integrated computing system. That single integrated 
system can then handle problems and processes too large and intensive for any single computer 
to easily handle in an efBcient maimer 

More specifically, grid computing is a form of distributed system wherein computing 
resources are shared across networks. Grid computing enables the selection, aggregation, and 
sharing of information resources resident in multiple administrative domains and across 
geogr^hic areas. These information resources are shared, for example, based upon their 
availability, capability, and cost, as well as a us^'s quality of service (QoS) requirements. Grid 
computing can mean reduced cost of ownership, aggregated and improved efficiency of 
computing, data, and storage resources, and enablement of virtual organizations for applications 
and data sharing. 



BACKGROUND 
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SUMMARY 

In one aspect, the invention features a method that includes, in a client server network, 
maintaining systems having grid managers having hierarchical relations, the relations of each 
grid manager stored in each of the systems. 

Embodiments may include the foUbwing. Each of the hierarchical relations are classified 
as superior or inferior, 

Jn another aspect, the invention features a system that includes a netwoik of computer 
systems, each of the computer systems including a grid mianagement engine, each of the grid 
managers having hierarchical relations with other grid managers, the relations of each grid 
manager stored in each of the systems. 

Embodiments may include the following. Each of the relations are classified as superior 
or inferior. 

In another aspect, the invention features a method that includes, in a network, starting an 
execution of a first service on a first computer, the first service handling at least locating, 
reserving, allocating, monitoring, and deallocating one or more computational resources for one 
or more applications using the network. The method further includes reading, by the first 
service, a file to inform the first service of a relation with a second service, wherein the first 
service has a inferior relation with the second service, the inferior relation meaning that the 
second service can srad a query for available computer resources to the first service: The 
method further includes establishing a first communication channel from the first service to the 
second service, and accepting an opening of a second communication channel from the second 
service to the first service. 

Embodiments may include one or more of the following. The method includes receiving 
a message to cancel the first service's inferior relation with the second service, closing the first 
and second communication channels, receiving a message to generate a inferior relation fix>m the 
first service to a third service residing in a third computer, establishing a third communication 
channel from the second service to the third service, and accepting an opening of a fourth 
communication channel fix)m the third service to the first s^ice. In some cases, establishing a 
first communication channel further includes determining if the second service responds to 
determining and if not, establishing a conmaunication channel to the second service after a 
predetermined time period. 
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In another aspect, the invention features a method that includes, in a network, starting an 
execution of a first service residing in a first computer, the first service handling at least locating, 
allocating, monitoiing, and deallocating one or more computational resources for one or more 
applications using the netwoik, starting an execution of a second service residing in a second 

5 computer, and reading, by die second service, a file to inform the second service of a relation 
with the first service, wherein the second service has a inferior relation with the first service, 
wherein the inferior relation indicates that the first service can send a quay for available 
computer resources to the second service. The method further includes establishing a first 
communication channel firom the second service to the first service, and establishing a second 

10 communication channel from the first service to the second service. 

Embodiments may include the following. The method of further includes receiving, by 
the second service, a message to cancel the second service's relation with the first service, 
closing the first communication channel, failing to respond to the second communication 
channel, receiving a message to create a inferior relation from the second service to a third 

15 service, establishing a third communication channel from the second service to the third service, 
and establishing a fourth communication channel fit>m the second service to the third service. 

In another aspect, the invention features a system that includes two or more computers 
each configured to run a service, the service handling at least locating, allocating, monitoring, 
and deallocating one or more computational resources for one or more applications. The system 

20 also includes a network of the services, the network configured such that a first service from the 
services has a superior relation with a second service from the services and the second service 
has an inferior relation with the first service, wherein the first service is configured to check the 
status of the second service in the network by waiting for a response to a query from the first 
service to the second service. 

25 Embodiments may include the following. The relation includes a first communication 

channel from the first service to the second service and a second communication channel from 
the second service to the first service. The first service is further configured to locate the one or 
more computational resources for the one or more applications by sending a query for available 
computational resources to the second service. The second service is further configured to 

30 remove its inferior relation with the first service and create a new superior relation with a third 
service. 
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These and other embodiments may have the following advantage. A fast and robust grid 
computing environment can be achieved using the dynamic tree structure for grid management. 

The details of one or more embodiments of the invention are set forth in the 
accompanying drawings and the description below. Other features, objects, and advantages of 
5 the invention v^ll be apparent from the description and drawings, and from the claims. 

DESCRIPTION OF DRAWINGS 

FIG 1 is a block diagram of a grid computing environment 

HG 2 is a flow diagram for discovering and reserving resources in the grid computing 
environment of HG 1. 

10 HG 3 is a flow diagram for installing, running, and removing applications in the grid 

computing environment of HG 1 . 

FIG 4 is a block diagram of a computer device in the grid computing environment of 

HGl. 

FIG 4A is a flow diagram for starting up an application in the computer device of FIG 4. 
15 FIG S is a flow diagram for starting up grid managers in the grid computing environm^t 

of HG L 

FIG 5 A is a block diagram of the grid computing environment of FIG 1 that is 
augmented with another computer device. 

FIG 6 is a block diagram of an exemplary a grid graphical user interface (GUI) 
20 component for visualization of a grid computing environment 
HG 7 is a block diagram of a grid browser component 
like reference symbols in the various drawings indicate like elements. 

DETAILED DESCRIPTION 

As shown in FIG. 1, services in a grid computing environment 100 manage 

25 computational resources for applications. The grid computing environment 100 is a set of 

distributed computing resources that can individually be assigned to p^orm computing or data 

retrieval tasks for the applications. The computational resources include computer devices 12, 

14, 16, 18, 20, and 22. The computer devices communicate using a network 8. The applications 

have scalable computational requirements. For example, an example application that uses 

30 computer devices 12, 14, 16, 18, 20, and 22 in the grid computing environment 100 is an internet 

4 
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pricing configurator. The computer device 12 provides network access to pricing information to 
users via web browsers on computer devices that are connected to the internet. The web 
browsers can be any application able to display content and/or execute applications such as web 
pages, media files, and programs, such as Netscape Navigator®, Microsoft Internet Explorei®, 
and similar applications. 

In this example, a web server on computer device 12 provides pricing information to the 
users. Calculation parameters for each price to be calculated are passed by an IPC dispatcher 
116 to IPC servers 120, 122, 124, and 126 that execute on computer devices 12, 14, 16, and 18, 
respectively. Due to the flexibility of the web SCTver and applications on the internet, the number 
of users can vary. This generates dynamic computational requirements for the internet pricing 
configurator. An IPC manager 118 communicates with services in the grid computing 
environment 100 so that the services can allocate and deallocate computational resources (e.g., 
processors in computer devices 12, 14, 16, 18, 20, 22) based on the dynamic computational 
requirraients of the internet pricing configurator. Allocating and deallocating computational 
resources in this manner allows computer devices 12, 14, 16, 18, 20, or 22 to be designated as 
general-purpose computational resources and not solely dedicated to handling peak demands of 
the internet pricing configurator application. The IPC manager 118 coordinates with the IPC 
dispatcher 116 so that the IPC dispatcher 116 has access to resources in network 8. 

This capability to allocate and deallocate the resources in the grid computing 
environment 100 enables the IPC manager 118 to locate and use available computational 
resources on an "as needed" basis. Once resources are located, the IPC manager 118 can use 
services in the grid computing environment 100 to install the IPC servers 120, 122, 124, and 126 
as applications on computer devices in the grid computing environment 100. The IPC dispatcher 
116 uses Web Service Definition Language (WSDL) interfaces defined in the Open Grid 
Services Miastructure (OGSI) Version 1.0 by Tuecke et al to manage and exchange the 
information flow between the IPC dispatcher 116 and IPC servers 120, 122, 124, and 126. For 
example, the OGSI WSDL interfaces can be used to pass computation parameters for pricing 
calculations from the IPC dispatcher 116 and die IPC servers 120, 122, 124, and 126. The OGSI 
WSDL interfaces can also be used to pass completed results from the IPC servers 120, 122, 124, 
and 126 back to IPC dispatcher 116. The OGSI Version 1.0 is incorporated herein by reference. 
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The OGSI WSDL interfaces enable the controlled, fault-resilient, and secure management of the 
grid computing environment 100 and applications such as the internet pricing configurator. 

While the IPC dispatcher 116 uses IPC servers 120, 122, 124, and 126 to perform 
calculations for users, services in the grid computing environment 100 monitor resource 
utilization on computer devices in the grid computing mvironment 100 running the IPC servers 
120, 122, 124, and 126. The services also send this utilization information to the IPC manager 
118. Based on a comparison between utilization requirements and current resource loading, the 
IPC manager 118 can dynamically inform services in the grid computing environment 100 to 
allocate more resoun:es for IPC servers 120, 122, 124, and 126 or deallocate resources to keep 
utilization of resources in the grid computing environment 100 at a desired level. 

Grid managers 152, 154, 156, 160, 162, and 164 are resident in computer devices 12, 14, 
16, 18, 20, and 22, respectively- Within the grid computing environment 100, pairs of grid 
managers can have directidnal relations that classify one grid manager as superior to another grid 
manager. A grid manager can have more than one supedor relations with other grid managers. 
For example, grid manager 152 has a superior relation with grid managers 154 and 156. A grid 
manager can also have more than one inferior relations with other grid managers. Through these 
hierarchical relations, IPC manager 118 does not need access to a list of all conqiuter devices in 
network 8 to use the computational resources in the grid computing environment 100. IPC 
manager 118 is only required to have access to a network address of one computer device 
running a grid manager (e.g., computer device 12 running grid manager 152) and this grid 
manager uses its relations with other grid managers running on other computer devices to 
provide IPC dispatcher 116 with indirect access to other computer devices in the grid computing 
environment 100. 

A grid manager (e.g., 152, 154, 156, 160, 162, and 164) maintains a first list of all 
superior relations with other grid manage and a second list of all inferior relations with other 
grid managers. Each grid manager maintains an "always open" commimications channel to all 
the grid managers in these lists over network 8 using, for example, the aforementioned OGSI 
WSDL interfaces on transmission control protocol (TCP), hypertext transfer protocol (HTTP), 
and simple object access protocol (SOAP). These lists and corresponding communication 
channels can be modified, allowing a dynamic reconfiguration of the grid hierarchy during 
runtime. This also allows a failing grid manager to be dynamically replaced in the hierarchy. 
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For example, referring to FIG. 1, if grid manager 154 fails, then grid manager 152 loses its 
connection to grid managers 160 and 162. In this case, relations between grid msmagers can be 
modified so that grid manager 152 has new superior relations to grid mana^rs 160 and 162. 
Likewise, grid managers 160 and 162 have new inferior relations to grid manager 152. 



internet pricing configurator) get necessary resources allocated in the network 8 before executing 
on a computer device (e.g., 12, 14, 16, 18, 20, or 22). Process 200 also guarantees if similar 
applications are trying to start at the same time on the same resource on a computer device that 
the two or more applications do not collide or interfere with each other. For example, the IPC 
10 manager 118 can require that an IPC server (e.g., 120) be the only application executing on a 
processor in computer device 14 for quality of service (QoS). In this case, another application 
would interfere if the other application simultaneously attempted to execute on the processor in 
computer device 14. 



15 requirements for computational resources to query a grid manager (e.g., 154) to deteimine if 
there are resources matching these requiremrats available in the grid computing environment 
100. These requirements specify iitfbnnation pertaining to resources in a computer device such 
as required number of processors, required percentage of utilization for those processors, main 
memory, and network speed. The query can also include information to which hierarchy level 

20 (in the grid computing environment 100) the query should be propagated. Process 200 includes 
grid manager 154 receiving (204) the requirements. 

To respond to the query for available resources from IPC noanager 118, process 200 
includes grid manager 154 matching (206) the requirements against resources known to grid 
manager 154. These resources include resources (e.g., processor 40) in computer device 14 that 

25 are directly managed by grid manager 154. Resources directly managed by grid manager 154 
that are currently available and meet the requirements are added to a resource-query list 
maintained by grid manager 154. 

(jrid manager 154 also sends the query to grid managers 160 and 162 having inferior 
relations with grid manager 154. Process 200 includes grid managers 160 and 162 responding 

30 (208) to the query by sending to grid manager 154 lists of resources (e.g., processors on 

computer devices 18, 20) that meet the requested requirements and are available and known to 



5 



As shown in FIG. 2, an application start process 200 is designed so applications (e.g.. 



Process 200 includes IPC manager 118 (or some other application) sending (202) 
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grid managers 160 and 162, respectively. These resource-query lists of resources that are known 
to grid managers 160 and 162 can also include resources managed by grid managers (not shown) 
with inferior relations to grid managers 160 and 162. Grid manager 154 adds these resource- 
query lists of available resources from grid managers 160 and 162 to its resource-query list of 
available resources meeting the requested requirements. If inx)cess 200 determines (210) that 
there is at least one resource (e.g., processor 40) in this resource-qu^ list, then grid manager 
154 sends (214) this resource-query list to IPC manager 118. Otherwise, if process 200 
determines (212) that grid manager 154 has a relation with a superior grid manager (e.g., grid 
manager 152), grid manager 154 sends (202) the query for available resources to grid manager 
152. In response to this query, grid manager 152 does not send a redundant query back to grid 
manager 154 having an inferior relation with grid manager 152. 

Process 200 includes grid manager 154 sending (214) the list of available resources along 
with addresses of tiieir corresponding grid managers in the network 8 that match the 
requirements. The IPC manager 118 selects a resource (e.g., on computer device 16) from the 
list and requests (216) a reservation of the resource on computer device 16 to the grid manager 
154 managing the resource on compute device 16. If the resource in computer device 16 is still 
available for reservation (218) and the reservation succeeds, grid manager 154 sends (220) a 
reservation numbCT to the IPC manager 118. This reservation means that the IPC manager 118 is 
guaranteed and allocated the requested resource on the computer device 16 in the grid computing 
environment 100. The grid manager 154 handles queries for available resources from 
applications such as IPC manager 118 using independent processing threads of execution. Thus, 
the grid manager 154 uses a semaphore to ensure that the same resource (e.g., processor 40) is 
not assigned multiple reservation numbers for different applications simultaneously requesting 
the same resource. 

If the grid manager determines that tiie requested resource in computer device 16 is not 
available for reservation and the reservation fails, the IPC manager 118 selects the next available 
resource in the list and requests (216) the reservation of this next available resource. If the IPC 
manager 118 receives a registration number and a timeout measured from the sending of the 
registration number does not expire (222), the IPC manager 118 starts (224) tiie IPC server 122 
on tiie processor 40 resource in computer device 16. Starting ttie IPC server 122 is initiated by 
passing the reservation number and an application file to the grid manager 156 and then grid 
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manager 156 reads the application file to install and execute the IPC server 122 on computer 
device 16. 

As shown in HG. 3, process 250 installs an application (e.g., IPC server 122) on a 
computer device (e-g., 14) in the grid computing environmrat 100 to set up an available resource 
for the application, using the available resource, and removing or deinstalling the application to 
free up the resource for use by subsequent applications when the resource is no longer needed. 
Process 250 includes IPC manager 118 transferring (252) an application file containing code for 
IPC server 122 in addition to instructions on how to install, customize, track and Kmovt the 
application from computer device 14 so that the grid manager 154 can return computer device 14 
to an original state after executing the application. 

IPC manager 118 transfers the application file using a file transfer protocol (FTP), 
hypertext transfer protocol (HTTP), or a file copy from a network attached storage (NAS) for 
example, to computer device 14 as a single fiile, such as a compressed zip file. Within this zip 
file there is information about installing and customizing the application IPC server 122. This 
information is represented by a small executable program or extended nuukup language (XML) 
document that is extracted and interpreted (254) by an installation and customizing engine (not 
shown) in grid manager 154. Ptocess 250 includes grid manager 154 installing (256) and 
running (258) the application. During installation (256), customization and execution (258) of 
the application, all changes to the computer device 14 are logged so that when the application is 
terminated (260) or deinstalled by grid manager 154 upon request by IPC manager 118, grid 
manager 154 removes the application fix)m the computer device 14 and also removes (262) any 
other changes to computer device 14 that were done when instaUing and running the application. 
Thus, the computer device 14 reverts to its originaQ state prior to execution of the application and 
all of the resources of computer device 14 are again available for use by a subsequent 
application. This allows the resources to become available after running the application without 
rebooting computer device 14. These changes include space in memory (e.g., 32) allocated to 
stoie and run application code in addition to other changes such as allocation of communication 
ports. 

In some examples, multiple applications can simultaneously run on resources in a single 
computer device (e.g., 14), Applications for the grid computing environment 100 are classified 
in part based on their resource requirements. Some changes to a computer device to mn an 
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application are only required for the first execution of an application of its class and subsequent 
executions do not require these changes. In these examples, grid manager 154 only does the 
changes for the first execution. Furthermore, when deinstalling the applications, grid manager 
154 only removes the changes for the last application thai was executed and terminated 

After installing applications on computer devices in the grid computing oivironment 100, 
grid managers are configured to start or stop the processes of these applications upon request. In 
the example of the internet pricing configurator (IPC) application, grid manage 154 is 
configured to start or stop IPC server 122 on computer device 14 after installing IPC server 122 
on computer device 14. The IPC manager 118 requests grid managers to start or stop IPC 
servers in the grid computing environment 100 based on current utilization of resources in the 
grid computing environment 100. After stopping IPC server 122 on computer device 14, IPC 
manager 118 waits a prespecified amount of time and then requests grid manager 154 to deinstall 
IPC servCT 122 if current resource utilization does not indicate a need to start IPC server 122 
again. Furthermore, as mentioned previously, grid managers monitor resource utilization on 
computer devices such as computer device 14 running applications (e.g. IPC servers 120, 122, 
124, and 126) and send this utilization infonnation to IPC manager 118. 

Jn many examples, control of application processes on resources in a computer device is 
specific to the operating system (OS). The grid computing environment 100 is configured to 
handle different operating systems on computer devices. Furthermore, grid computing 
environment 100 is designed to handle different applications (e.g., internet pricing configurator) 
that do not have to be redesigned to execute on the grid computing environment 100. A grid 
manager controls an application process in a general manner that decreases interdependence 
between development of grid manager code and application code. An interface is provided to 
application code to enable grid managers to discover, control (e.g., start, stop, halt, resume) and 
inspect or monitor a state of application processes. The interface is provided for operating 
system processes that are exposed by the operating system or hosting environment and includes 
three aspects. One aspect of the interface is process data, such as process identification, states, 
degiee of resource consumption (such as Central Processing Unit (CPU), memory, socket 
bindings, or other resources that an application can use), and application specific data defined by 
a process data scheme. 
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A second aspect of the interface is managing operations, such as start, stop, wait, resume, 
change priority, and other operations defined by supported managing operations. 

A third aspect of the interface is control bindings and definitions, such as process data 
scheme, supported managing op^tions, and communication bindings. Since not all applications 
running in the grid computing enviroimient 100 have access to the same information and 
capabilities in these three aspects, the applications provide to grid managers a list of quedes and 
commands that each application supports. 

The interface provided to application code is an Application Program Interface (API). 
The API is a set of methods (embedded in software code) prescribed by the grid manager 
software by which a progranuner writing an application program (e.g., internet pricing 
configurator) can handle requests from the grid manager. 

As shown in FIG. 4, IPC server 122 includes an API 302 and a document 304. Since the 
API 302 is adapted to different types of applications, the document 304 describes how grid 
manager 154 conmiunicates with the IPC server 122 and what requests through the API 302 are 
supported by the IPC s&rver 122. Grid manager 154 reads document 304 before starting up IPC 
server 122. In some examples, document 304 is written in XML and includes a Document Type 
Description (DTD) 306. A DTD is a specific definition that follows the rules of the Standard 
Generalized Markup Language (SGML). A DTD is a specification that accompanies a document 
and identifies what the markups are that separate paragraphs, identify topic headings, and how 
each markup is to be processed. By including the DTD 306 with document 304, grid manager 
154 having a DTD "reader" (or "SGML compiler") is able to process the document 304 and can 
correctly interpret many different kinds of documents 304 that use a range of different markup 
codes and related meanings. 

As shown in HG. 4A, grid manager 154 uses process 350 to install applications such as 
PC server 122. Grid manager 154 reads (352) DTD 306 in document 304 to identify markups in 
document 304. Grid manager 154 reads (354) document 304 using miarkups to identify 
communication parameters for communicating with IPC server 122. Grid manager 154 sets up 
(356) communications with IPC sgtvct 122 based on the specifications of the commimication 
parameters. Grid manager 154 communicates (358) with IPC server 122 using the 
communication parameters to send requests such as "Start**, "Stop", and "Are you idle?". 
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Before any applications (e.g., internet pricing configurator) can be executed on network 
8, grid managers 152, 154, 156, ItiO, 162, and 164 are asynchronously started up on computer 
devices 12, 14, 16, 18, 20, and 22, and relations to other grid managers are established As 
shown in FIG. 5, process 400 initializes relations among grid managers. For each grid manager 
(e.g., grid manager 154), the grid manager 154 starts up on computer device 14 by reading (402) 
a properties file. The properties file contains a list of addresses of computer devices with grid 
managers having superior relations to grid manager 154. This list was described earlier as a first 
list of all superior relations with other grid managers. If (404) a superior grid manager (e.g., grid 
manager 152) is specified in this list of addresses, grid manager 154 requests (406) to open a 
communication channel to the superior grid manager (e.g., 152). If grid manager 152 is already 
started, then grid manager 152 responds by accepting the request of the opening of the 
communication channel from grid manago: 152. Process 400 includes grid manage 154 
detecting (408) any requests for communication channels from grid managers (e.g., grid 
managers 160. 162) identified as having inferior relations with grid manager 154, K process 400 
determines (410) that there are some requests, grid manager 154 allows communication chaimels 
j&om the inferior grid managers (e.g., 160, 162). Ptocess 400 includes grid managpr 154 
checking (414) if there are any pending requests for communication to grid managers havmg 
superior relations. If there are any pending requests, grid manager 154 requests (406) 
communication channels to grid managers. These communication channels are used for resource 
queries between grid managers (as described previously) and "heart beaf ' messages between grid 
managers to ensure that each grid manager in the grid computing environment 100 is 
functioning. 

Once grid managers 152, 154, 156, 160, 162, and 164 are running with established 
relations, the grid managers are used for the proper operation of the grid computing environment 
100. Often during the lifecycle of the grid computing environment 100 the functionality of the 
grid managers are enhanced. It is often not possible or convenient to shut down the grid 
computing environment 100 and start the grid computing environment 100 up with the 
enhancements. Grid managers 152, 154, 156, 160, 162, and 164 are configured so that there is 
only a minimal impact on users of the grid computing environment 100 when a change happens. 
To enable this transparency, an API is provided for user interfaces to enable an administrator of 
grid computing environment 100 to access each of the grid managers 152, 154, 156, 160, 162, 
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and 164 individually or all together. The API is static in that it includes only one method, i.e., a 
string that contains a command typed by the administrator. The API is dynsunic because the 
string can contain many different commands. 

In some cases, the grid managers are developed using the Java progranuning language. 
In these cases, new commands issued to the grid managers can be supported by loading new or 
revised Java classes dynamically via classloadm. This dynamic access to code can be done 
without shutting down grid managers in the grid computing environment 100. Using Java 
classloaders, each time an instance of a class for a grid manager is generated, the definition and 
behavior of the class can be updated to provide new functionality to the grid computing 
environment 100. 

Another way to modify the functionality of the grid computing environment 100 
dynamically without shutting down the grid computing environment 100 is to change the 
hierarchical relations between grid managers, remove grid managers, or add new grid managers. 
The API provided for administration of the grid computing environment 100 is also configured 
to send strings to individual grid managers with commands to delete existing relations or add 
new relations. 

For administrators of grid computing environment 100, it is usefiil to visualize the 
applications and a grid manager on one computer device in the grid computing environment 100 
as well as other computer devices running part of the grid management hierarchy in the form of 
grid managers with one or more levels of inferior relations to the grid manager. The view of 
these computer devices is referred to as a grid landscape. As shown in FIG. 6, a grid graphical 
user interface (GUI) 500 for visualization of a grid landscape, such as the grid computing 
environment 100, includes GUI-elements visualizing an organization of services running on 
computer devices. The GUI 500 provides a grid-like structure with columns and rows. Rows 
represent services, which in turn are structured hierarchically with respect to the application 
where a service belongs to, the type of the service, and the specific service instances, Each 
service instance row is associated with a place in the grid computing environment 100 
ropiesenting where it is instantiated. In this context, columns represent the computer devices in 
the grid landscape. Specifically, GUI 500 has three columns representing three computer 
devices 12, 14, and 16. GUI 500 shows that grid manager 152 runs on computer device 12 with 
inferior grid managers 154 and 156 running on computer devices 14 and 16, respectively. GUI 
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500 also shows internet pricing configurator services running on computer device 12. These 
internet pricing configurator services include IPC dispatcher 116, IPC server 120, and IPC 
manager 118. 

The GUI 500 is dynamically refi:eshed with feedback from the grid managers and internet 
5 pricing configurator (or other application) services so that new services appear in GUI 500 to an 
administrator. Similarly, services that are shut down are removed in GUI 500. 

As shown in FIG. 7, a grid browser component 600 is a composite graphical user 
interface (GUI) for browsing grid managers on computer devices in the grid computing 
environment 100. The component 600 displays a graph with curved edges and vertices. Vertices 
^ 10 represent computer devices in the grid computing environment 100 and curved edges represent 
the directional association of grid managers on two computer devices (vertices) in the grid 
computing environment 100. This association is hierarchical (i.e., superior/infaior). Each 
vertex displays the network address of a computer device as well as applications currently 
running on the computer device. For example, component 600 shows computer devices 12, 14, 
15 16, 18, 20, and 22 with IPC servers 118, 120, 122, and 124. In other examples (not shown), the 
grid browser component 600 shows non-hierarchical, peer to peer associations of grid managers 
with non-directional edges representing the associations. 

The grid browser component 600 is context sensitive. Depending on the relationship 
among the grid managers on the computer devices (e.g., superior/inferior), computer devices are 
20 traversed in respect to a user*s browsing history. 

By clicking on a vertex representing a computer device in GUI 600 (e.g., computer 
device 14), a user can automatically view a grid manager and applications running on the 
computer device and grid managers having inferior relations to the grid manager using GUI 500. 
The user can pick a computer device and see relations between its grid manager and other grid 
25 managers. This connection between GUIs 500 and 600 is done using software that generates 
GUIs 500 and 600. 

The network 8 can be implemented in a variety of ways. The network 8 includes any 
kind and any combination of networks such as an Internet, a local area network (LAN) or other 
local network, a private network, a public network, a plain old telephone system (POTS), or 
30 other similar wired or wireless networks. Communications through the network 8 may be 
secured with a mechanism such as encryption, a security protocol, or other type of similar 
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mechanism. Communications through the network 8 can include any kind and any combination 
of communication links such as modem Hnks, Ethernet links, cables, point-to-point links, 
infrared connections, fiber optic links, wireless links, cellular links, Bluetooth®, satellite links, 
and other similar links. 

The network 8 is simplified for ease of explanation. The network 8 can include mdie or 
fewer additional elements such as networks, conununication links, proxy servers, firewalls or 
other security mechanisms, Internet Service Providers (ISPs), gatekeepers, gateways, switches, 
routers, hubs, client terminals, and other elements. 

Computer devices 12, 14, 16, 18, 20, and 22 communicate over medium 10 using one of 
many different networking protocols. For instance, one protocol is Transmission Control 
Protocol/Internet Protocol (TCP/IP) combined with SOAP (Simple Object Access Protocol). 

Embodiments of the invention can be implemented in digital electronic circuitry, or in 
computer hardware, firmware, software, or in combinations of them. Embodiment of the 
invention can be implemented as a computer program product, i.e., a computer program tangibly 
embodied in an information carrier, e.g., in a node-readable storage device or in a propagated 
signal, for execution by, or to control the operation of, data processing apparatus, e.g., a 
programmable processor, a computer, or multiple computers. A computer program can be 
written in any form of programming language, including compiled or interpreted languages, and 
it can be deployed in any form, including as a stand-alone program or as a module, component, 
subroutine, or other unit suitable for use in a computing environment. A computer program can 
be deployed to be executed on one computer or on multiple computers at one site or distributed 
across multiple sites and interconnected by a communication network. 

Method steps of embodiments of the invention can be performed by one or more 
programmable processors executing a computer program to perform functions of the invention 
by operating on input data and generating output Method steps can also be performed by, and 
apparatus of the invention can be implemented as, special purpose logic circuitry, e.g., an FPGA 
(field programmable gate array) or an ASIC (application-specific integrated circuit). 

Processors suitable for the execution of a computer program include, by way of example, 
both general and special purpose microprocessors, and any one or more processors of any kind of 
digital computer. Generally, a processor will receive instructions and data from a read-only 
memory or a random access memory or both. The essential elements of a computer are a 
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processor for executing instructions and one or more memory devices for storing instructions and 
data. Generally, a computer will also include, or be operatively coupled to receive data from or 
transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, 
magneto-optical disks, or optical disks. Information carriers suitable for embodying computer 
program instructions and data include all forms of non-volatile memory, including by way of 
example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; 
magnetic disks, e.g., internal hard disks or removable disks; niagneto-optical disks; and 
CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or 
incorporated in special purpose logic circuitry. 

To provide for interaction with a user, embodiments of the invention can be implemented 
on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal 
display) monitor, for displaying information to the user and a keyboard and a pointing device, 
e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of 
devices can be used to provide for interaction with a user as well; for example, feedback 
provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory 
feedback, or tactile feedback; and input firpm the user can be received in any form, including 
acoustic, speech, or tactile input 

Embodiments of the invention can be implemented in a computing system that includes a 
back-end component, e.g., as a data server, or that includes a middleware component, e.g., an 
application server, or that includes a front-end component, e.g., a client computer having a 
graphical user interface or a Web browser through which a user can interact with an 
implementation of embodiments of the invention, or any combination of such back-end, 
middleware, or front-end components. The components of the system can be interconnected by 
any form or medium of digital data communication, e.g., a communication network. Examples 
of communication networks include a local area network CLAN") and a wide area network 
CWAN**). e.g., tiie Internet. 

The computing system can include clients and servers. A client and server are generally 
remote from each other and typically interact through a communication network. The 
relationship of client and server arises by virtue of computer programs running on the respective 
computers and having a client-server relationship to each other. 
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A number of embodiments of the invention have been described Nevertheless, it will be 
understood that various modifications may be made without departing from the spirit and scope 
of the invention. Other embodiments are within the scope of the following claims. 
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WHAT IS CLAIMED IS: 

1 1 . A method comprising: 

2 in a client server network, maintaining systems having grid managers having 

3 hierarchical relations, the relations of each grid manager stored in each of the systems. 

1 2. The method of claim 1 in which each of the relations are classified as superior or 

2 inferior. 

1 3. A system comprising: 

2 a network of computer systems, each of the computer systems including a 

3 grid management engine, each of the grid managers having hierarchical relations with 

4 other grid managers, the relations of each grid manager stored in each of the systems. 

1 4. Hie method of claim 3 in which each of the relations are classified as sujperior or 

2 inferior. 

1 5. A method comprising: 



2 in a network, starting an execution of a first service on a first computer, the first 

3 service handling at least locating, reserving, allocating, monitoring, and deallocating one 

4 or more computational resources for one or more applications using the network; 

5 reading, by the first service, a file to inform the first service of a relation with a 

6 second service, wherein the first service has a inferior relation with the second service, 

7 the inferior relation meaning that the second service can send a query for available 

8 computer resources to the first service; 

9 establishing a first communication channel from the first service to the second 

10 service; and 

1 1 accepting an opening of a second communication channel from the second service 

12 to the first service. 

1 6. The method of claim 5 further comprising: 

2 receiving a message to cancel the first service's inferior relation with the second 

3 service; 

4 closing the first and second communication channels; 
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5 receiving a message to generate a inferior relation from the first service to a third . 

6 service residing in a third computer; 

7 establishing a third communication channel from the second service to the third 

8 service; and 

9 accepting an opening of a fourth communication channel from the third service to 
10 the first s«:vice. 

1 7. The method of claim 5 wherein establishing a first communication channel further 

2 comprises determining if the second service responds to determining and if not, 

3 establishing a communication channel to the second service after a predetermined time 

4 period. 

1 8. A method comprising: 

2 in a network, starting an execution of a first service residing in a first computer, 

3 the first service handling at least locating, allocating, monitoring, and deallocating one or 

4 more computational resources for one or more applications using the network; 

5 starting an execution of a second service residing in a second computer; 

6 reading, by the second service, a file to inform the second service of a relation 

7 with the first service, wherein the second service has a inferior relation with the first 

8 service, wherein the inferior relation indicates that the first service can send a query for 

9 available computer resources to the second service; 

10 establishing a first communicatidn channel fipom the second service to the first 

11 service; and 

12 establishing a second conmiunication channel from the first service to the second 

13 service. 

1 9. The method of claim 8 further comprising: 

2 receiving, by the second service, a message to cancel the second service's relation 

3 with the first service; 

4 closing the first communication channel; 

5 failing to respond to the second communication chaimel; 
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6 receiving a message to create a inferior relation from the second service to a third 

7 service; 

8 establishing a third communication channel from the s^ond service to the tU 

9 service; and 

10 establishing a fourth communication channel from the second service to the third 

11 service. 

1 10. A system comprising: 

2 two or more computers each configured to run a service, the service handling at 

3 least locating, allocating, monitoring, and deallocating one or more computational 

4 resources for one or more applications; 

5 a network of the services, the network configured such that a first service from the 

6 services has a superiorrelation with a second service from the services and the second 

7 service has an inferior relation with the first service, wherein the first service is 

8 configured to check the status of the second service in the network by waiting for a 

9 response to a query from the first service to the second service. 

1 11. The system of claim 10 wherein the relation comprises a first communication channel 

2 ftt>m the first service to the second service and a second communication channel fix>m thr 

3 second service to the first service. 

1 12. The system of claim 10 wherein the first service is further configured to locate the 

2 one or more computational resources for the one or more applications by sending a query 

3 for available computational resources to the second service. 

1 13. The system of claim 10 wherein the second service is further configured to remove its 

2 inferior relation with die first service and create a new superior relation with a third service. 
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ABSTRACT 

A method includes, in a grid computing environment, maintaining systems having 
grid managers having hierarchical relations, the relations of each grid managCT stored in each 
of the systems. Each of these hierarchical relations are classified as superior or inferior. 

5 207S1446.doc 
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Abstract 

At least three factors in the existing migration frame- 
woriis make them less suitable in Grid systems especially 
when the goal is to improve the response times for indi' 
vidual applications. These factors are the separate poli- 
cies for suspension and migration of executing applica^ 
tions employed by these migration frameworks, the use of 
pre-defined conditions for suspension and migration and 
the lack of knowledge of the remaining execution time of 
the applications. In this paper we describe a migration 
fntmework for performance oriented Grid systems that im- 
plements tightly coupled policies for both suspension and 
migration of executing applications and takes into account 
both system load and application characteristics. The main 
goal of our migration framework is to improve the response 
times for individual applications. We also present some, 
results that demonstrate the usefulness of our migration 
frameworks 



1. Introduction 

Computational Grids [8) involve lairge system dynam- 
ics such that the ability to migrate executing applications 
onio different sets of tesources assumes great importance. 
Specifically, the main motivaUons for migrating applica- 
tions in Grid systems are to provide fault tolerance and to 
adapt to load changes on the systems. 

In this paper, we focus on migration of applications ex- 
ecuting on the distributed and Grid systems when the loads 
on the system resources change. There are at least two 
disadvantages in using the existing migration frameworks 
(1 1, 16, 19, 9, 1 1] for improving the response limes of exe- 
cuting applications. Due to the separate policies employed 
by these migration frameworks for suspension of execut- 
ing applications and migration of the applications to dif- 

*This work is supponed in pan by the Naiiona) Science Foundation 
coninici GRANT «EIA-997S02a SC «R365O5-29200099 and GRANT 
lfEIA-9975015 



feient systems, the applications can incur lengthy waiting 
times between when they are suspended and when they are 
restarted on new systems. Secondly, due to the use of pre- 
defined conditions for suspension and migration and due to 
the lack of knowledge of the lemainii^ execution time of 
the applications, the applications can t>e suspended and mi- 
grated even when they are about to finish execution in a 
short period of time; 

In this paper, we describe a framework that defines and 
implements scheduling policies for migrating applications 
executing on distributed and Grid systems in response to 
system load changes. In our framework, the migration of 
applications depends on 

1 . the amount of increase or decrease in loads on the re- 
sources,. 

2. the time of the application execution when load is in- 
troduced into the system, 

3. the performance benefits that can be obtained for the 
application due to migration. 

Our migration framework is primarily intended for 
rescheduling long running applications that typically ex- 
ecute for several minutes. The migration of applications 
in our migration frameworic is dependent on the ability to 
predict the remaining execution times of the applications 
which in turn is dependent on the presence of execution 
models that predict the total execution cost of the applica- 
tions. The framework has been implemented and tested in 
the GrADS system [2]. Our test results indicate that oiir 
migration framewoi-k can help improve the perfonnance of 
executing applications by more than 30%. 

In Section 2, we describe the GrADS system and the life 
cycle of GrADS applications. In Section 3. we introduce 
our migration framework by describing the diflerent com- 
ponents for migration. In Section 4, we describe our exper- 
iments and provide various results. In Section 5. we present 
related work in die field of migration. We give concluding 
remarks and explain our future plans in Section 6. 
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Figure 2. Interactions in Migration framework 



Launcher The Application Launcher spawns the job on 
irS::^ mShinruM„g Globus job 
Miism and also spawns a component called Contract Moni- 
t^mCon^t Monitor through an Autopilot med^j.^ 
U 31 monitors the times taken for 

.ions The GrADS aichitecwie also has a GrADS Infomta- 
urkISsU«>(Gm) thai maintains the different stat« of 
raS^managerand the states of the numeral «^ 
S£ After spaw„i„g.henumericalapphcanonto«.g^ 
S«AK.Hcation Launcher, the applieaUon manager wai«f« 
toMoTomplete. flie job can either complete or suspend 
£«i:u.^n due to ext^al intervendon. These apphea- 
^ su^« P^sed to the applicaUon manager through 
rcSTtf^ job has completed, the applicaUon manager 
St^p^ngs^cess values totheuset 

^p^. the application manager waits for a '"f'™' *'8^ 
collei^ new machine infpm«tionbysutrt.ngfh». 

the lesouroe selecdon phase again. 
3. The Migration Framework 

, -n.eabili.ytomigrateapplica.ionsU.*^rAre^ 
is implemented by adding a component called f^f^^^ 
oSrADS architecwre. THe migrating ""^""1!?^?^ 

olication's progress and the reseheduUr that decioes wncn 
fo m^ together form the core of the migrauon frame- 
wrT^^C inSactions between the different components 
r„S«r«Sg««ionframewo*isillustra«^^^^^ 
2 These components are described in detail in the foUow- 
ing subsections. 
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3.1. The Migrator 

We have implemented a user-level checkpointing library 
called SRS (Stop Restart Software). The application by 
making calls to SRS possesses the ability to checkpoint 
data, to be stopped at a pantcular point in execution^ to be 
restancd and continued later on a different configuration of 
processors. The SRS library is implemented on top of MPl 
at the application layer and migration is achieved by clean 
exit of the entire application and restarting the application 
over a new configuration of machines. The application in- 
terfaces for SRS look similar to CUMULVS ( 1 0], but unlike 
CUMULVS, SRS does not require a PVM virtual machine 
to be setup on the hosts. Although the method of reschedul- 
ing in SRS, by stopping and restarting executing applica- 
tions, incurs more overhead than process migration tech- 
niques (4, 5. 15K the approach followed by SRS allows re- 
configuration of executing applications and portable across 
difTenrni MPI implementations. 

The SRS library consists of 6 main functions 
. SKS.InitO. SRS_FmishO. SRS_Restan_ValucO, 
SRS.rhcck-StopO. SRS.Reg!SierO and SRS-ReadQ. The 
user cjIIs SRS.InitO and SRS.FmishQ in his application 
aru-r Niri InitO and before MPUHnalizeO respectively. 
In 4*ff Jcr i«) know if the application is executed in the start 
(« rcNijft mi>dc. the user calls SRSJ^estart-ValueQ that 
n-iumv Dam) I on si.in flnd restart modes respectively. The 
UNt-r jlv t colls SRS.Checkj^topO at different phases of the 
apptK.tii»4« !■> check if an external component wants the 
^pplKji I* i*t he Slopped. 

SKS lihrury umts Internet Backplane Protocol(lBP)[12]. 
t«« fl ihc checkpoint data. IBP depots are started 

»m all ihr machines of the GrADS testbed. The user 
^alU SKs.Rr|!iMeri) in his application to register the vari- 
ihji he checkpointed by the SRS library. When 
M ctu nvjt «:«>ntpiHicni stops the application, the SRS li- 
hf jTv ^ hr«. kf^Hnt.\ «»nly (hosc variables that were registered 
ihrt^i^h SkS.ki*piNicr(). The user reads in the checkpointed 
J-iu m ihc rrMjrt nuxlc using SRS-ReadQ. The user, 
iliri^ph SKS RcoiSn. also specifies the previous and cur- 
rvni Jju «SixtnKiiM»n.v By knowing the number of proces- 
vit\ jrnl Ihr J;ita JiMrihuitons used in ihc previous and cur- 
ri-ni e%oiuii%>n i»f the application , the SRS library automat- 
K-alK p^-rttvnix the appropriate data redistribution. Thus, 
Itw example, the UMrr can start his application on 4 proces- 
s(tr\ u iih MiK-k diMrihuiion of data, stop the application and 
revun ii im h privevsors with block-cyclic distribution. The 
detaiU ol tlie SRS API for accomplishing the automatic re- 
Ji^tnhuittin of Jala in beyond the scope of ihe current dis- 
cu^Hum. 

An external compiincnt(c.g., the rcscheduler) wanting to 
Miipan executing; application intcracLs with a daemon called 
Runiinic Support System (RSS). RSS exists for the entire 



duration of the application and spans; across multiple mi- 
grations of the application. Before the actual parallel ap* 
plication is started, the RSS is launched by the application 
launcher on the machine where the user invokes the GrADS 
application manager. The actual application through the 
SRS library interacts with RSS to perform some initializa- 
tion, to check if the application needs to be stopped during 
SRS.Check^topO and to store and retrieve pointers to the 
checkpointed data. 

3.2. Contract Monitor 

Contract Monitor is a component that uses the Autopi- 
lot infrastructure [13] to monitor the progress of the appU- 
cations in GrADS. An autopilot manager is started t)efoie 
the launch of the numerical application. The numerical ap- 
plication is instrumented with calls to send the execution 
times taken for the differehi phases of the application to the 
contract monitor. The contract monitor compares the acoial 
execution times with the predicted execution times and cal- 
culates the ratio between them. The tolerance limits of the 
ratio are specified as inputs to the contract monitor. 

When a given ratio is greater than the upper tolerance 
limit, the contract monitor calculates Che averse of the 
computed ratios. If the average Is greater than the upper 
tolerance limit, it contacts the rescheduler, requesting for 
migrating the application. The average of the ratios is used 
by the contract nK>nitor to contact the rescheduler due to the 
following reasons: 

1 . A competing application of short duration on one of 
the machines may have increased the load on the ma- 
chine and hence the loss in performance of the appli- 
cation. Contacting the rescheduler for migration on 
noticing few losses in performance will result in un- 
necessary migration in this case since the competing 
application wiU end soon and the application's perfor- 
mance will be back to norrnal. 

2. The average of the ratios also captures the history of 
the behavior of the machines on which the application 
is running. 

3. The average of the ratios also takes into account the 
percenuge completed time of application's execution. 

If the rescheduler refuses to migrate the application, the 
contract monitor adjusts its tolerance limits to new values. 
Similarly when a given ratio is less than the lower tolerance 
limit, the contract monitor calculates the average of the ra- 
tios and adjusts the tolerance limits if the average Is less 
than the lower tolerance limit The dynamic adjusting of 
tolerance limits not only reduces the amount of communi- 
cation between the contract monitor arul the rescheduler Init 
also hides the deficiencies in the application-specific execu- 
tion time model. 
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33. Rescheduler 

Rescheduler is the componem that evaluates ihe perfor- 
maiKC benefits thai can be obtained due to the m.granwi 
of an application and imiiaies the migration of the applica- 
tion. It operates in tvw) modes: migration on nquest and 
opportunistic migration. When the conuact monitor detecB 
intolerable perfomtance loss for an application. ^ontac^^ 
ihe rescheduler requesting it to migrate the appl.cat.on. TTus 
is called migration on request. In oth^ cases .f a CrADS 
application was recenUy completed, the rescheduler dewr^ 
mh«s if performance benefits can be obtained for an exe- 
cuting application by migrating it to use 
were freed by the completed application. This is called op- 
portunistic rescheduling. KT— .-.J, 
In both cases, the rescheduler first coniarts the Network 
Weather Service (NWS) to get the updated infonnation for 
tf« machines in the Grid. It then contacts the »PP»'«"«»- 
specific performance modeler to evolve a new schedule for 
ATappUcalion. Based on the total percentage complelKm 
time f« the applicaUon and the touU P^^'f «f "i^^;* 
time for the awKcaUon with the new schedule, the resched- 
uler calculates the remaining execution dme. re/jiew^f the 
application if it were to execute on the machines in the new 
sSSedule. The rescheduler also calculates «/.£:«r«»«. the 
remaining execution time of the numerical application if it 
were to continue executing on the oiiginal set of machines. 
The rescheduler then calculates the rescheduling gain as 

ifret-current - irttjnevf -4- 900)) 
reschedulinp-^in = retjcurrent 

TTie number 900 in the numerator of the fracUon is the 
worst case time in seconds needed to reschedule the appli- 
cation. The various times involved in rescheduling is giv«i 
in Table 1. The times shown in Table 1 were obtained by 
conducting a number of experiments widi different problem 
Sizes and obtaining ihe.maximum times for each phases of 
rescheduling. Tlius the rescheduling strategy adopts pes- 
simistic approach for resdieduling where migrauon of ap- 
plications will be avoided in certain cases where migrauon 
can «eld performance benefits. 

Ifihe rescheduling gain is greater than 30%. tfie resched- 
uler sends STOP signal to the application, and stores Uie 
s»p sutus in CIR. The application manager 
Ihe RESUME signal. IT* rescheduler stores the RESUME 
value in the GIR thus prompting the application manager to 
evolve a new schedule and restart the appKcatSonon the new 
schedule. If the rescheduling gain is lesS than 30% and ifOie 
rescheduler is operating in tiie migratiim on rr^uwr mode, 
the rescheduler contacts die contract monitor prompung the 
contract monitor to adjust its tolerance limits. 

The rescheduling Oueshold 117] which tiw perfonnance 
gain due to rescheduling must cross for reschedohng to 



ll Rescheduling fhase 

W 


i ItffC 1 

(sees ) 1 




40 1 


Wailing for NWS to update mfonnaiion 


90 


Time for application manager lo gel new i«- 
1 source informaiion from N^VS 


120 


1 Evolving new application-level schedule 


80 


Oiher grid overhead 


10 


Starting application 


60 


Reading checkpoints and Daia redisinDuiion 


500 1 


Total 


1 900 II 



Table !• Times for rescheduling phases 

yield significant perfonnance benefits depends on the load 
dynamics of the system resources, the accuracy of the mea- 
surements of resource infomiation and may also depend on 
the particular application for which rescheduling is made. 
Since the measurements made by NWS are fairiy accu- 
rate, the rescheduling threshold for our expenmenls de- 
pended only on the load dynamics of the system resources. 
By means of trial-and-crroi experiments we determined the 
rescheduling threshold for our tesibed to be 30%. 

4. Experiments and Results 

The GrADS experimental lesibed consists of about 40 
machines that reside in institutions across United States m- 
eluding University of Tennessee, University of nUnois, Uni- 
vcisity of California at San Dicgo, Rice University etc. For 
the sake of clarity, our experimental lesibcd consists of rwo 
clusters, one in University of Tennessee and another m Urn- 
vcrsityofUlSnois.Urt>ana^ampaign.TheTennessceclus- 

ter consists of 8 933 MHz dual- processor Pentium HI mar 
chines wnning Unux and connected to each other by 100 
Mb switched Ethernet. The Blinois cluster consists of 16 
450 MHz single-processor Pentium U machines runmng 
Linux and connected to each otfier by 1 ^8 Gbit/sccond foil 
duplex myrineL the two chislcis are connected by means 

A^i5 applications. namely. ScaLAPACK LU andQR 
factorizations, ScaLAPACK eigen value problems. PETSC, 
CG application and heat equation solver have been inte- 
grated into the migration frameworic by instrvmcnOngUw 
applications wiUi SRS calls and writing performance mod- 
els for the applications. In general, our migration fismie- 
work is suitable for iterative parallel applications for whidi 
performance models predicting ihe execution costs can be 
written, m our experiments, ScaLAPACK QR factonzauon 
was used as the end application. The data tfiat were check- 
pointed by tiie SRS library for the application included the 
matrix, A and Oie right-hand side veaor. B. 
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In all the experiments in this section. 4 Tennessee ma- 
chines and 8 Illinois machines were used A given ma- 
trix size for the QR faaorization problem was input to the 
application manager For large problem sizes, the com- 
putation time dominates the communication lime for the 
ScaLAPACK application. Since the Tennessee machtries 
have higher computing power than the Illinois machines, 
the application manager by means of the performance mod- 
eler chose the 4 Tennessee machines for the end application 
run. A few minutes after the start of the end application, 
anificlal load is introduced into the 4 Tennessee machines. 
This ahificia] load is achieved by executing a certain num- 
ber of loading programs on each of the Tennessee machines. 
The loading program used was a sequential C code that con- 
sists of a single looping statement that loops forever. This 
program was compiled without any optimization in order to 
achieve the loading effecL 

Due to the loss in predicted performance caused by the 
artificial load, the contract monitor requested the resched- 
uler to migrate the application. The rescheduler evaluated 
the potential performance benefits that can be obtained by 
migrating the application to the 8 Illinois machines and ei- 
ther migrated the application or allowed the application to 
continue on the 4 Tennessee machines. The rescheduler was 
operated in two modes - a default and a non-default mode. 
The normal operation of the rescheduler is its default mode 
and the. non-default rhode of the rescheduler is when the 
rescheduler code was modified to force the application to ei- 
ther migrate or continue on ihe same set of resources. Thus 
in cases when the default mode of the rescheduler was to 
migrate the application, the non-default mode was to con- 
tinue the application on the same set of resources and in 
cases when the default mode of the rescheduler was to not 
migrate the application, the non-default mode was to force 
the rescheduler to migrate the application by adjusting the 
rescheduling cost parameters. For each experimental run, 
results were obtained for both when rescheduler was oper- 
ated in the default and non-default mode. This allowed us 
to compare both scenarios and to verify if the rescheduler 
made the right decision. 

Three parameters were involved in each set of experi- 
ments - the size of the matrices, the amount of load and the 
time after the start of the application when the load was in- 
troduced into the system. The following three sets of exper- 
iments were obtained by fixing two of the parameters and 
varying die other parameter 

In the first set of experiments, the artificial load consist- 
ing of 10 loading programs was introduced into the system 
5 minutes after the start of the end application. The bar 
chart in Figure 3 was obtained by varying the size of the 
matrices, t.e. the problem size on the x-axis. The y-axis 




7000 8000 8000 10000 11000 

Stzsolnts&k»s(N) 



Hgure 3. Problem Sizes and Migration 

represents the execution time in seconds of the. entire prob- 
lem including the Grid overhead. For each problem size, the 
bar on the Hft represents the execution time when the appU-. 
cation was not migrated and the bar on the right represents 
the execution time when the application was migrated. 

Several points can be observed from Fijgure 3. The time 
for reading checkpoints occupied most of the reschedul- 
ing cost since it involves moving data across the Internet 
from Tennessee to Illinois and redistribution of data from 
4 to 8 processors. On the other hand, the time for writ- 
ing checkpoints is insignificant since the checkpoints are 
written to local disks. The rescheduling benefits are more 
for large problem sizes since the remaining lifetime of die 
end application when load is introduced is larger for larger 
problem sizes. There is a particular size of the problem 
below which the migrating cost overshadows the perfor- 
mance benefit due to rescheduling. Except for matrix size 
8000, the rescheduler made the correct decision for all ma- 
trix sizes. For matrix size 8000. the rescheduler assumed 
a worst-case rescheduling cost of 900 seconds while die 
actual rescheduling cost , was close to about 420 seconds. 
Thus the rescheduler evaluated the performance benefit to 
be negligible while die actual scenario points to the con- 
trary. Thus the pessimistic approach followed by using a 
worst-case rescheduling cost in the rescheduler will lead to 
underestimating the performance benefits due to reschedul- 
ing in some cases. 

In the second set of experiments, matrix size 12000 was 
chosen for die end application and artificial load was in- 
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Figure 4. Load Amount and Migration 

inKhici-J :o nunuics into the execution of the applicaUon. 
In ,h.s MTi ,^ experiments, the amount of artificial load was 
wn -Jhv xorvinpihcnumberof loading programs that were 
c.rwutL-a Ifi Vi-ure 4, the x-a;iis represents the number of 

pr^^rams and the y-axis represents 
,.„K ,n vcc.-.d>. For each amount of load, the bar on tt« 
U II »cprr^*•nl^ ihr case when the application was continued 
,^ A 1cnr>cNv:r nin.hincs and the bar on the right represents 
t aM- V. n ihc application was migrated lo 8 Hlinois ma- 

^^"Z tt^ J ox cxpcrimcnis, shown in Figure 5, equal 
.„^^nt ol fcud oH,sisung of 7 loading programs was in- 
,r»0;Kcd *i a»IL-rcni pcnnts of execution of theend appli- 
i^iLm Lit thr ^nxc problem of matrix size 12000. The x- 

rv pa wnts the cbpscd lime in minutes of the cxccuuon 
..I cnJ jprl.vai.on when the load was iniroducedl. T^e y- 
au. ri-pfi-sc-niv ^hc unal execution lime in swonds Simi- 
Ur IM the pcw-uHiv experiments, the bars on the left denote 
Ok ^^hcx %^hcn ihe applieaiion was not rescheduled and the 
har^ the mh, repre^^ni the cases when the apphcauon 

lewheJuW-d. Fnmi Figures 4 and 5. we observe that the 
,x r».*""ncc henel.1. due lo rescheduling increase w.di the 
Ihh^...! ot jnd dixrcasc as the load is miioduced later 
into ih»- p»i»j:rai« ciccuiion. 

4^. Opportunistic Migration 

111 ihi. >ei uf cxiKrinienis, wc illustrate oppoministic mi- 
l.n.ii.m in which ihc icschcdulcr tries to migrate an execut- 
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Figure 5. Load Introduction Time and Migra- 
tion 



ing application when some other appUcaiion completes. For 
(h^ experiments, two problems were involved. For the 
first problem, matrix size of 14000 was input to the applica- 
tion manager and 6 Tennessee machines were made avail- 
able. The application manager, through the performance 
modeler chose the 6 machines for the end application nin. 
Two minutes after the start of the end application for the first 
problem, a second pn)blcm of a given matrix size was input 
to the application maniger. For the second pn>blcm, the 6 
Tennessee machines on which the first problem was cxc- 
culing and 2 Illinois machines were made available. Due 
to the presence of the fiisi problem, the 6 Tennessee ma- 
chines alone were insufficleni to accommodate the second 
problem. Hence die performance model chose the 6 Ten- 
nessee machines and 2 Illinois machines for the end appli- 
cation and die actual application nin involved communica- 
tion across the Internet. 

In the middle of U« execution of die second application, 
the first application completed and hence Uie second appli- 
cation can be potentially migrated to use only die 6 Ten- 
nessee machines. Although Uiis involved consincung the 
number of processors of Uic second application from 8 W 
6 there can be potential performance benefits due to U» 
non-involvement of Inienict The reschcduler evaluated the 
potential performance benefits due to n«gration and made 
an appropriate decision. 

Figure 6 shows die results for two illustrative cases when 
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matrix sizes of the second application were 13000 and 
140CX). The X-axis represents the matrix sizes and the y-axis 
represents the execution time in seconds. For each applica- 
tion run. three bars are shown. The bar on the left represents 
thecxecudon time for the first application that was executed 
on 6 Tennessee machines. The middle bar represents the 
execution time of the second application when the endre 
applicadon was executed on 6 Tennessee and 2 Illinois ma- 
chines. The bar on die right represents the execution time 
of the second application, when the applicadon was initially 
executed on 6 Tennessee and 2 Illinois machines and later 
migrated to execute on only 6 Tennessee machines when the 
first applicadon completed. 

In both problem cases, matrix sizes 13000 and 14000, 
for the second problem, the reschcduler made the correct 
decision of migrating the application. We also find that for 
both problem cases, the second application was almost im- 
mediately rescheduled after the compledon of the first ap- * 
plication. 

5. Related Work 

DifTerent systems have been implemented to migrate ex- 
ecuting applicadons onto diflerent seu; of resources. These 
systems migrate applications either to efficiently use under- 
utilized resources [14, 5. 4. 16, 6], to provide fault resilience 
{1} or to reduce the obtrusiveness to workstation owner 
1 1. 1 1 J. The particular projects diat are closely related to our 



work are Dynamite 116], MARS t9|, LSF [19) and Condor 
MIJ, 

The Dynamite system 116) based on Dynamic PVM (6J 
migrates applications when the loads of certain machines 
are under-utilized or over-utilized as defined by application- 
specified thresholds. Although this method takes into ac- 
count application-specific characteristics it does not neces- 
sarily evaluate the remaining execution lime of the appli- 
cation and the resulting performance benefits due to inigra- . 
lion. MARS (9) migrates applications taking into account 
both the system loads and applicadon characterisUcs. But 
the migration decisions are made only at different phases of 
the applications unlike our migiadon framework where the 
applications are continuously monitored and migration de- 
cisions are made whenever the applications are not making 
sufficient progress. 

In LSF [19], jobs can be submitted to queues which have 
pre-defined migration thresholds. A job can be suspended 
when the load of the resource increases beyond a particular 
limit and can be migrated when the time since the suspen- 
sion becomes higher than the migration threshold for the 
queue. Thus LSF suspends jobs to maintain the load level 
of the resources while our migration framework suspends 
jobs only when it is able to find better resources where 
the jobs can be migrated. By adopting a strict appro^h 
to suspending jobs based on pre-defined system limits, LSF 
gives less priority to the suge of the application execution , 
whereas our migration framework suspends an application 
only when the application has large erMHigh remaining ex- 
ecu iior> time so that performance benefits can be c>btained 
due to migration. And lastly, due to the sq>aradon of the 
suspension and migration decisions, a suspended applica- 
tion in LSF can wait for a long time before it restarts exe- 
cuting on a suitable resource. In our migration framework, 
a suspended application is immediately restarted due to the 
tight coupling of suspension and migration decisions. 

Of the Grid computing systems, only Condor [1 1) seems 
to migrate applications under workload changes. Con- 
dor provides powerful and flexible ClassAd mechanism by 
means of which the administrator of resources can define, 
policies for allowing jobs to execute on the resources, sus- 
pending the jobs and vacating the jobs from the resources. 
The fundamental philosophy of Condor is to increase the 
throughput of long running jobs and also respect the own- 
ership of the resource adnunistrators. The inain goal of our 
migration framework is to increase the response times of in- 
dhridua) applications. Simitar to LSF, Condor also separates 
the suspension and migration decisions and hence has the 
same problems mentioned for LSF in taking into account 
the performance benefits of migrating the applications. Un- 
like our metascheduler framework, the Condor system does 
not possess the knowledge about the remaining execution 
time of the applications. Thus suspension and migrating 
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decisions can be invoked frequently in Condor based on 
system load changes. This mny be less desirable m GrnJ 
systems where system load dynamics are fairly high. 

6. Conclnsions and Future Work 

Maiiy existing migration franieworks that migrate appli- 
cations under loading conditions implement simple polices 
AM cannot be applied to Grid systems. We have .mple- 
mented a migration framewortc that lakes ui.o account both 
the system load and application „ 
ments were conducted and results were presented to demon- 
strate the capabilities of the migraUon framework. 

Of the various costs involved in rescheduhng, the cost 
for dau redistribution is the only significant cost that d^ 
nends on the number and amount of checkpomted data^ Ae 
"ata distributions used for the data and 
tuie processors sets for the application. We aie planning 
to modify the SRS library and the interactions m the mi- 
gnition framev«>rk so that the redistribuuon cost can be dy- 
namically calculated. Also, insteadof fixingthenacheduler 
threshold at 30%. our future work will involve deiernjuning 
the rescheduling threshold dynamically based on die dy- 
namic observation of load behavior on the sy stem resources. 
R^ly. we propose to invesdgate the usefulness of our a,^ 
pmach for complex applications involving rnuluple comp<^ 
nenu and/or written in multi-programming languages. 
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ABSTRACT 

Emeri^ng national-scale "CompntaUonal GricP infrastruc- 
tmes are deploying advanced services beyond those taken 
for granted in today's Intemet: for example, anthentica* 
tion^ remote access to oonpntos, resource management, and 
directory serrices. The availabflity of these 6e i v i ccs repre- 
sents both an opportonity and a challenge for the appfi- 
cation developer: an opportunity because thcgr enable ac- 
cess to remote resources in new "ways, a challenge because 
these services may not he compatible with the commod- 
ity distribnted-compating technologies tised for application 
development. The Cooimodity Grid project is working to 
overcome this diflioolty by creating what we call Commodity 
Grid Toolkits (CoG Kits) that define me^^pin^ and inter- 
nees between Grid and particular commodity frameworks. 
In this paper, we eacplain why CoG Kits aie. important, de- 
scribe the design anid implementatifm of a Java CoG Kit, 
and nse eiramples to iUostrate how CoG IGts can enable 
new approadkes to application devdopment based on the 
integrated nse of commodity and Grid technologies. 

Categories and Subject Descriptors 

D.1.3 [Software Engineering): Concurrent Program- 
ming — Distribttted programming; D.2.6 [S of t war e £n^ 
neeringj: Programming Environmenti^^^rrophioal enoiroo- 
men<5; I>.2.6 [Software Engineering]: Programming &k- 
vxronments — J^uyi o/rwncr wothbcn^i 

1. INTRODUCTION 

The explosive growth of the Internet and of distributed com- 
puting in genoal has led to r^id technology development 
in several domains. In the worid of commodity oompnting, a 
broad spectrum of distributed oompoting technblo^es 
Web protocols [16], Java [14], JIKI [1], CORBA (4}. DGOM 
[2% etc) lias emerged with revolntiosary effects CO how we 
access and process information. SimoHaneoosly, the high- 

Pcmnssion id make digital or hanS copies of all or part ofthb worii for 
•posonal orcbssfOomttK it granted viriihout fee provided that copies 
me noi made or distrihutcd fbr prof i or commercial advsnta^: anl dm 
copks bear thH notice and the foil citation on the fiist pafie. To copy 
otherwise, to rcpuMtsh. to post on scrvcn or to lodismlwtc to Usis, 
racjuires prior specific pciiiussion and/or a ftc. 
Java 2000 San Fnnetsoo CA USA 
Cbpyrfsht ACM 2000 1^1 13.2B8.3J0QW..XS.0D 



perlinmanoe computing community has taken big steps to- 
ward the creation of so-called Grids [7], advanced infrastmc- 
tares dfrignpd to enable the coordinated nse of distributed 
high-end icsouxces flor scteutific inf^IeBK solving. - 

These two worlds of what we will call "commodity" and 
"Grid"' computing have evoh^ed in paraUel« with difierent 
goals leading to different emphases and technolpgy s^tions. 
Fbr examplr, commodity tedbnolog^s tend to focus on issues 
of scalability component composition, and desktop presen- 
tation, while Grid deveii^>eis emphasize end-to-end perform 
mance, advanced network ser v ices, and. support for uniq;ue 
iGources socdx as euperoomputers. The resalis of this paral- 
lel evolution are multiple tedmology sets with some overlaps, 
mu^ cbn^lementanty, and soma obvious g^ipft. 

In this contesct, we believe that it is tiznely to investigate 
how the worids of commodity and Grid computing can be 
combined. Hpnce, we have established the Commodity Grid 
(CoG) project, with the twin goals of (a) enabling devdop- 
exs of Grid applications to expknt oommodSty technologies 
wherever possible and (b) exporting Grid tedmolo^es to 
commodity computing (or, eqolvalently, identifying modifi- 
cations or extensions to commodity tfrhnoTogjes that can 
tender them lucue usefnl for Grid spplicatioBs)* 

A fiist activity being ondertaken within the CoG project 
is the design and developmeat of a set of Commodity Grid 
TbaUdts (CoG Kits), ^idtic^ we define as faQows: 

Deflnitiom A Commodity Grid IbdUdt (GoG Kit) de- 
fines and implements a set of general compoiiteBts that 
map Grid lunctionaKty into & coBUBodlty envifOD" 
nent/framework. 

Hence, we can imaeine a Web/OGI CoG Kit, a Java CpG 
Kit, a CORBA CoG Kit, a DCOM CoG Kit, and so ea. 
In each case, the. benefit of the CoG Kit is that it enables 
applicatioa devdopexs to exploit advanced Grid services (re- 
source management, security, resource discovery) while de- 
vdoping higha-level components in terms of the familiar 
and powafnl application development frameworks provided 
by commodity technologies. In each case, we also face the 
challenge of developing appropriate interiaces between Grid 
and commodity concepts and technologies— and. If 
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Protocols, AothemicatioD, Policy, Instnimematioiu 
Resource Managennenu Discovery, Events* etc 



Storage Networlcs, Computers, Display Devices, etc. 
and their associated local services 



Figufe 1: The integrated Grid ardiitectiire has four main cateeoriea* 




Figure 2} Mult^le portals provide access to over* 
lapping ftmctionattty, wHh a particular portal spe- 
cialised to tfaa requirements of its user. 



ble by scxeatlsts, anowing for feedbadk to check for model 
accQjacy. Access points to the postal include oompater ter> 
minab tn etectromcaQy rnhnnrpd tarn bmlduigs and also 
specialized input and cctpnt devices that allow for the in- 
stallatioik in, for »*ai«pte^ a U^itweig)bt wirdess device to 
Boora a us^ul snbset of the information in the field. The 
farmer's portal also provides access to other services and i»- 
formatifm oooiccs, for example, fintmnnl market monitoring 
services that observe thefloctnatloB of the value of the crops 
and give advice thai maj result in peater profits (Figure 2). . 

3^ Sdence Portal Requirements 

The creation of sdence portals such as those just described 
requires the integration of many tedmologies from different 
fields. We will typlcaOy provide access to a wide variety 
of data; hence, we most be able to access and commnni^ 
ccic with a wide range of tnformation stmrecs. The complex 
^■aitftilatm^y poformed on this data requires the ability to 
access compute resources with significant computational re- 
soarce& We may alsD require access to pr^rietaiy software 
loaded on remote machiuffi. Thus, the obi^ to inafrporaU 
remete eoinpuiaiioiiat ruouroea is seqmied. Interactive.use 



can require that computadooal and data r c s o m ces be ac- 
cessed via higb-performaace n etw u iks ; we would abo lilce to 
be able to enforce pef/ormence ^aoraaiees for daika traasfora 
competatioQS. 



The success of a sdence portal b also measured by its us- 
abifily and acceptance in the rrnnnuinity. Hence, we reqnize 
environments that allow rapid prototyping of both complete 
applications and new components that can be shared with 
other usas. The ahiUty to rapidly create porUMe luer tn-> 
tcrfaccB is particularly critical. These req;uiremeat8 overly 
etroni^y with two types <^ technf>lo fly; 

> Commodtip Uthnologit* that emphasize ease of use 
and code reuse in local (especially desktop) environ* 
ments: GUI components, component libraries, script- 
ing languages, indnstry-aooepted cltstribnted comput- 
ing frameworks, industrial-strength database serves, 
object-oriented programming languaeRfl and frame- 
varies, and the I3ce. 

> Grid technologies that emphasize. effective operation 
in large-scale, nmlti-institvtional, wide area envirozk- 
ments: access to remote computation, infbrmatian 
services, high-speed data transfezs^ special protocob 
(e«., multicast), and gatewagrs to local anOieatiaitioa 
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These con^deratiaDS lead to the qoesUon that has motivBtcd 
the research reported in tins papen How can commodity and 
Grid tedmidogies interCace and integrate so as to adhere in- 
teroperability — and, ideally, to enhance the capabilities of 
both? Ffar example, we mi^t dedde to use CORBA for 
application development, but also want to use Grid services 
lor and maxta^ng computations on a supercom- 

puter. Or, if we are using Java, then Jlni might appear to 
be a good mechanisin for rcsoufce discovery: but then we 
face the problem of accessing data stored in the cxtcamve 
(currently LDAP-based) Grid information service. The in- 
teractions can be complex and require significant effort by 
thought to get right. Yet the technology base that exists in 
each case is sufficiently large and robust that exploiting these 



BNSOOCIO: ^ aor^oBQA ■ , 



Um4 


xvd Gitf tBiBftn* 10 did M 










HBM Gfavof 





GridMiddkmA 



Figure 4t Applieatioas aad more compleK eoaapo- 
nents can be bu&t with the be|p of the CoG Kit. 
Camponents are claaaified here based cm their role. 



6* JAVA COG lOrriMPLEMENTmON 

^gure 5 shows ham our Java CoG Kit is used in practice. 
This Java program skdeton farms part of a Oimate Pcutal; • 
it demozxsirafces hcTw simple it is to bu3d portat-specific aer* 
vices when accessing a variety of basic Grid services through 
the Java CoG Kit. In this example, aa appropriate madbiae . 
is selected fisr eseciitiony data for yycfea«^| ^^f^;ff| 
dimate modd is located and downloaded to the marhiiifr, 
aad the cGmate modd is executed on that marhme. The 
program generates aa output file ia GrADS [12] fossnat* a 
weB-known format iat storing thTee-<£mensional dixaate re- 
lated data. Throu^kmt the remainder of paper we will ex- 
pand this example as -we introduce various Java CoG Kit 



6.1 Low-Levd Grid Mailings 

In the this section we enumorate a subset of packages that 
provide the inter&oe to the kyv-level Grid services and 
plication inter&oes* These p&dkages are used by many users 
to develop Java-based programs in the Grid. Vh -will de-> 
scribe only the general fonctionality of these packages, as it 
is beyond the scope of this paper to explain every dass and 
method. Fbr a complefce list of the dassra and metluMis «e 
re£er to the dustriho^on (27)* 

RSI*. The package crg.globuM.rsl provides methods for ore* 
ating, manipulating, and checking the validity of the RSL 
expressions used in Globus (11] to cjcpiti a s resource tequir^ 
meats. As shown in Step 3 of Figure 5, the arguments to 
a new call to Indnde parametezs that specify both char^ 
a ct c ris tics of the required resources and properties of the 
compotatiop* 

GRAM. The padcage crg.gUbtis^gmm provides a ma|^faig 
to the Globus GRAM services [10], whidi allow users to 
schedule and manage remote computations. The dasses and 
methods <£stributed allow users to submit jobs, bind to at- 
ready submitted jobs, and caned jobs on remote comput- 
ers. Other methods allow users to determine whether they 
can submit jote to a spedfie lesoaroe (throng a Globus 
gatekeeper) and to moaitor the job status (pendinff, octroe, 
faSei, done, and Mupended). 
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// Step 0. Initialisation 
MDS mdsssnew MDS(**www.g|obiia.or8*, 
389,1o=<Md"); 

// Step 1. Search for aa available machine 
resolt = mds.aeaxdk - 

( "(obJectclasssGridComputeResooree) 

(6eenodes=64})",'tentoct-); 

//Step l.a) Select a madiins ' 

minimd acecDticn *>w freai* * 
the contacts that are 
returned in rc8utt> - • 

.// Step 2. Prepare.tbe data far the fizpoimeat 

// Step 2.a) Search for climate data and return 
// attributcst server j>ort/iirectory,file 

(*(ob}ectcta8S=CllmateData)(year=sl999) 

(region=nnidwest**,*'dn'*, MDS^ubtreeScope); 
result = mdsJookop (dn,"Berver port directory 
file*); 



// St9 2J>) download the t 
ori = t«salt.get<*tarv«*>f 
+ resnlt.get("poit**)4-"::* 

+ resnIt.get(Mirectary">+V' 
+ rcsult.get(-fi}e"); 

data s server Jctcfa (nil, marhinrContact); 

// Step 3. Prepare a de scrip ti oB for nmaiag the 

RSL Hi = pew |gLC?(execataMfgclimarj>Modd) 
{ pr o cea a ora . ■ C 4) 

(argumentss-gradsXaigument8=a-out map.grads) 
(arginnest8=-fai * data filename -fr^)*); 
// Step 4. Submit the prop^m 
GramJob Jeb = new GramJobQ; 
job.addJobLi8tetter(new GramJobUstenerO { 
pobiie void 8tate(%aaged(GramJ€b Job) { 
// react to Job state r 



tfy{ 



> 



Jebjeqaest(piachineContaci, nl); 
U^(^a inKiniptlnB e) 



) 



Figure 63 This sample sct^pt 
aceess basic Grid services udtb the 
CoG Kit. Here data for a climate 
an appropriate marhnw la selected, 
nkodd is executed on that 



oTthe Jama 
Is located, 
tbe climate 



// Step Oi Initialise the taUe 

MDSscarcfa T^e table ~ 

new MDSscarthTaile(md8); 

// Step 1: perform « search in the MDS 
// to request data to be displayed 
tab)e3eaixhC(obiectclasss<SridCompnteResDurccaf , 
"hn gramveision contact*); 

// Step 2: display and update the table - . . 
tabte.Bbaw(); 

// Step 3: retuni the sdectloB 

String tnaebineCantacts - ■ 

table.getSelect!on(*cont5cft"); 

Figure B: The program shonwB the ease of tise of the 
Graphical User Interface for eelecting a Grid contact 
etnas* (compare Figure 7). 

suits of MDS search queries (Figures 7 and 8)» trees that 
display the directory ixxformation tree of the MI>S» and tai- 
bles to display HBM and network performance data. Eadb 
coDtpaBenfc can be mifft^>wiy*iT4 aad is available as JavaBean. 
In fntuie rdeascs of the Java GoG Kit it wjOl be possible 
to integrate the bean Is a Java^based GXJl composition tool 
such as JBuSlder or VisoalCafe. 

6 A Higb-Levd Graphical Application 

High-levd grapbical applications combine a variety CoG 
Kit compoxkents to deliver a single application or apfdet. 
NafenraUy, these appBcafcions can be combined in order to 
provide even greater functionaUty. The nser should sdect 
the toob that seem appropriate for the task, lb demonstrate 
the range of appfications, ive have iadnded a set cf screen 
damps that highfig^t the look aad fod of soine ^pUcations 
developed to date. 

GECCO. The Graph Enabled Gcnsole Component 
(GECCD) is a graphical tool for spedfying and monitorfaig 
the execution of sets of tasks vdth dependencies between 
them [26][24|. Specifically it allows one to 

1. specify the jobs and their dependencies graphically or 
with the help of an XMXr-based configuration file; 

2. dfebog the spedificatioD la order to find erroneous spec* 
;fyyri«^ string before the Job is submitted; and 

3. execute and monitor the job graphically and with the 
h^of afogfile. 

As shown in Figure 9 each job is rep re s e nte d as a node in 
the graph. A joh is executed as socm as its ptedccesson 
are reported as having successfully completed. The state 
of a job Is animated vith colors. It is possible to modify* 
the specification of the job vhQe diddag on the node: A 
specificatioii window pops up allowing the user to edit the 
RSL, the label, and other parameters. Editing can also be 
performed during runtime (job execution), hence providUag 
Cor tdmple computational steering. 

GRC. A second example of a high-level application con»- 
ponent b an interactive Gf^^Ucal Resoozce Co-allocator 



(CRC) illustrated in Figure 10 [5]. This Java application 
allows the user to build a network representing the resources 
required for an application and to describe how the resources 
should be used. A combination of automatic and manual 
techniques is then used to guide resource selection, even- 
tually generating an RSI:* specification iaa the resources in 
question. MDS services are used to airtomatically find candi- 
date sets of resources that meet the user's constraints. The 
user then manually sheets one of the resource sets or re- 
quests a further search for candidates. Once the user finds 
a suitable set of resources, the GRAM or DUROC disnt li- 
-braries are used to executermonitori and-possibly terminate 
the application(8) (compare Figure 10). 

7. FUTURE APPLICATIONS 

The availability of the Java CoG Kit has several advantages 
for develoinng future Grid-hased^yplicatioas. The assumed 
platform independence of Java and its rnrreased popularity 
provide the basis of a promisiag platfosm in the near fotmre. 
FWthermore, since Java is well established on the 'Windows 
operating '^stem, it seems an obnons randtdate for deliv- 
ering a Globus server-side implementation^ hence allowing 
jobs to be submitted to any NT mftrbiup as long as it is 
integrated in the Grid. More straightforward is the develop- 
men* of a Globus thi»«lieat, whidi constitutes mHy of the 
necessary security routines and the oommumcatlon routines 
to communicate with a Globus server. All previous r e l ease s 
of CoG components used a pull modd to inquire about the 
state of a submitted job. Since we have changed the mode! 
to use listeners, it is now easier to write threaded Grid-based 
Java appHcatioBS based on a push model. Projects that wHI 
benefit from this approach are» tat example. Gateway 
and ^K%hfiow [28]. 

The latest Globus system to relies in many cases on the 
HTTP protocol, hence it is pos^ble to integrate such a thin- 
client as part of a Web browser to allow submission through 
w^ pages.. Projects like WebSubmit [19] aad Hotpage [23] 
will profit from this change. Making some eonaponeBta avail- 
able as Java Beans and integrating thesn into common of- 
the-shdf Java GUI building tools will provide a Grid de- 
vebprnent environaent that aDovrs Grid laogramming "with 
ease. As a result of the avaHabOiiy of the Java CoG Kit, 
recent efforts to standardize the Globus delegation model in 
cooperation with the development of the Java CoG Kit will 
allow a mudfc eaaex iate gt a l i on in oommodifey technology la 
future. 

8. SUMMARY . 

Commo<£ty distriboted-computing technologies enable the 
zapid construction of sophisticated cBeat-server applica- 
tions. Grid tedinotogycs proride advanced aeiwroih services 
ta large-scale, vide area, multi-institutional enviioBmentB 
aad for applications that leqube the coordinated use of mul- 
tiple resources. In the Commodity Grid project, we seek to 
bridge these two worlds so as to enable advanced applica- 
tions that can benefit from both Grid services and sppldsti- 
cated commodity development environments. 

The Java Commo<fity Grid Tbolldt (CoG Kit) described in 
this paper r ep r esents a fiost attempt at oeatlag of such a 
bridge^ Building on experience gained over the past three 
years with the use of Java in Grid environments, we have 



Fig^ire XOi The GRG cJlcyws to select a compute resource for scbedulios a Job interactively from a set of 
automatically derived machines that fulfill a user^specific constraint. 



defmed a ridi set of classes that provide the Java program- 
mer with access to basic Grid services, enhanced services 
soitahle for the definition of desktop problem solving envi- 
ronments, a&d a range of GUI dements. Initial experiences 
with these components have been positive. It haa proved 
pasable to recast major Grid services in Java terms vithont 
compromisuig on (nnctionality. Some substantial Java CoG 
Kit ^plications have been developed, and reactions liom 
users have been positive. 

Oar future work will involve the integration of more ad- 
vanced Bervices into the Jaeva CoQ IQt and the creation of 
other CoG IQts, with CORBA, DCX>M» and Python bong 
early prioritiGS. We also hope to gain a better understand* 
ing of where changes to commodity or Grid technologies can 
facilitate interoperability and of where commodity technolo- 
gies can be explcrfted in Grid cnvironmeiits. 

9. AVAILABIUTy 

The Java Cog Kit is available in alpha release form the CoG 
Kit Web pages [27]. The release of the components Is done 
gradually to assure the necessary quality control of the de- 
. hvered padcages, classes* and methods. At present, the main 
distribution contains the low-level components. Besides the 
components described in this paper» we have an implemen- 
tation of network based quality-o^service methods. We ex- 
pect that this r^^^*^^ will be rde&sed as $oob as the Globus 
tooUdt API for this area is frozen. For more release notes, 
we refer to the Web page http://www.globns.org/cog. 
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Abstract . 

Building Problem Solving environments in the emerging 
national-scale-Computaiional Grid infrastructure is a chal' 
lenging task. Accessing advanced Grid services, such as 
authentication, remote access' to computers, resource man- 
agement, and directory services, is usually not a simple 
matter for problem solving environment developers. The 
Commodity Grid project is yvorking , to overcome this dif- 
ficulty by creating yvhat we call Commodity Grid Toolkits 
fCoG Kits) that define mappings aiui interfaces betvveen 
the Grid and'particular commodity frarneyiforksfamiUctr to 
problem solving eMronrnen t developers. In this paper, we 
explain why CoGXits arie ifijportantfor'problern solving en- 
vironment developers, describe tlie design and irriptementO' 
Hon of a Java CoGKU, and'tise examples to illustrate how . 
CoG Kits can enable new approaches to application devel- 
opmeru based on the integrated use of commodity arut Grid 
technologies. 



I. Introduction 

The development of Dext-generation problem solving 
environments (PS£s)[12] is influenced by rapid advances 
in the world of commodity computing and the emerging 
national-scale Computational Grid. The explosive growfli 
of the Internet and of distributed computing io general 
has led to significant technology improvements in sev- 
eral domains that are important for the development of 
PS£s accessing large-scale computational resources. In the 
world of commodity computings a broad spectrum of dis- 
Uibuted computing technologies (Web protocols, Java, JINI, 
CORBA", DCOM, etc.) has emerged, with revolutionary ef- 
fects on how we access and process information. Simulta- 
neously, the higb-perfonnance computing community has 
taken big steps toward the creation of so-called Grids, ad- 
vanced infrastructures designed to enable the coordinated 
use of distrilnited high-end resources for scientific problem 
solving. 



These two worlds of what we will call ^commodity** and 
'^Grid*' computing have evolved in parallel, with different 
goals leading to different emphases and technology solu- 
tions. For example, commodity technologies tend to focus 
on issues of scalability, component composition, and desk- 
top presentation, while Grid developers emphasize end-to- 
end performance, advanced network services, and support 
for unique resources such as supercomputers. The results 
of this parallel evolution are multiple technology sets with* 
some overlaps, much complementarity, and some obvious 
gaps. 

In this context, we have established the Commodity Grid 
(CoG) project^ with the twin goals of (a) enabling devel- 
oper§ of PSEs to exploit commodity technologies wherever 
possible and (b) exporting Grid technologies to commodity 
computing for easy integration in PSEs. 

A first activity being undertaken within the .CoG project 
is the design and development of a set of Commodity Grid 
Toolkits (CoG Kits), that define and implement a set of gen- 
eral comi>onents that map Grid functionality into a com- 
modity environment/framework. Hence, we can imagine a 
Weh/CGI CoG Kit, a Java CoG Kit, a CORBA CoG ICit, 
a DCOM CoG Kit, and so on. In each case, the bene- 
fit of the CoG Kit is that it enables application developers 
to exploit advanced Grid services (resource management, 
security, resource discovery) while developing higher-level 
compcments io terms of the familiar and powerfiil applica- 
tion development frameworks provided by commodity tech- 
nologies. In each case, we also face the challenge of devel- 
oping appropriate interfaces between Grid and commodity 
concepts and technologies — and, if similar Grid and com- 
modity services are provideid, reconciling competing ap- 
proaches. 

As part of these activities, we have successfully devel- 
oped a Java-based Commodity Grid Toolkit (Java CoG Kit) 
that defines and implements a set of general components 
mapping Grid functionality into the Java framewoik. The 
Java CoG Kit is of particular interest for PSE developers 
because it allows them to implement preinslalled heavy- 
weight applications to be started on user-accessible com- 
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pute servers, as well as lightweight Web interfaces or portals 
allowing access to sophisticated remote compute services. 

The primary goal of our research is not to build a PSE 
that will solve a specific problem for a particular applica- 
tion area. Instead, our focus is on developing a software 
infrastructure to make it easier to build and deploy pow- 
erful PSEs. We have based otir development of the Java 
CoG kit 00 our experiences with application users in various 
problem domains. Thus, we are confident that the toolkit is 
general enough to be useful for a large number of PSE de- 
velopers. 

While we have introduced in [ 1 7] the general concepts of 
the Java CoG Kit, we will illustrate in this paper its practical 
use in the development of problem solving environments. 
Additionally, we introduce here new C^l^nents and more 
sophisticated security concepts that are^jgparticular interest 
to developers of chemistiy problem-sof;§Hg envboninents., 

2. Portals to Problem Solving ]^y!ron merits 



For readers to' understand the scopcVoTthis work, we 
explain the terms pwbiemSolvipg en^^nrpeni and portai^ 
since multiple definitions are lised for%^^' terms in^;the lit- 
erature. ■ "^W^- • 

2J* Problem Solving Ehvirohinjen^^^^^^^^ . 

Our understanding of a PSE foIlows^OTOximately the 
definition given in' [7]: **A'probfem soly^^leiivironmentls 
a computational system that provides a .^^^lete and ccm- 
venient set of high level tools for solving-^>n)b]ems froin a 
specific domain. The PSE allows users t^j|efine and mod- 
ify problems, choose solution strategies, interact with and 
manage appropriate hardware and software resources, vi- 
sualize and analyze results, and record and coordinate ex- 
tended problem solving tasks. A user communicates with 
a PSE in the language of the problem, not in the language 
of a particular operating system, prpgramming language, or 
network prDtocol.** 

For our research focus we assiune that the problems must 
access remote resources, potentially in a secure fashion, 
and may require a large amount of compute and/or data re- 
sources. The process of solving the problem is steered by 
the scientist, and its progress may be monitored through In- 
ternet browsers or special-purpose application-monitoring 
fn'ograms. 

2.2. Requirements for PSE Portals 

We identified a list of characteristics that influenced our 
PSE toolkit design [1]: 

Problem-oriented. The PSE should allow specialists to 
concentrate on their discipline, without having to become 



experts in computer science issues, such as networks, paral- 
lel computing, or the World Wide Web. 

Integrated. Many problems and their solution strate- 
. gies are extremely heterogeneous: in models, codes, appli- 
cations, and machines. A PSE must be designed to manage 
this heterogeneity in an integrated way, so that the user is 
presented with a predictable and consistent PSE. 

Collaborative. Most science and engineering projects 
are performed in collaborative mode with physically dis- 
tributed participants. A PSE must include the ability to 
foster collaborative solution strategies. We assume that, a 
general-purpose video conferencing tool can be provided 
with common off-the-shelf tools developed by commercial 
companies. Nevertheless, it may be necessary to develop 

- -special-purpose collaborative tools that are not provided 1^'- ' 
third parties. : v/ - 

DistWbuied. Besides the need to support distributed col^- ' 
laboration between scientists, many problems we have bed^ 
dealing with (such as Grand Challenges) can be solved on!)f- . 
-A while accessing large distributed resources (such as stcN^.%* 
.'-age and compute resoiuces) in conjunction with each othd^^'^.^* 
• A PSE must be able to access these distributed resourc^ 
seamlessly and in collaboration.' ".^^7^^ 
Persistent. Since developing a solution for a problemf;-iV. 
■:may require significant time, it is desirable to provide a peirsJi- ' 
' . sistent environment that allows the researcher, to resume tl&'-'f 
solution process at a later time at a potentially different l<^^^ '4 
cation. Thus, it is necessary to be able to checkpoint nbi; > * 

- dnly the state of the calculation but also the state of the PS^i^.-< 
iiser interface. The persistence of a PSE could be enhancedrii.':' 
with preferences that are either set by the user or are de^*^ 
tected automatically by the PSE. Such functionality could'^'.^ 
be achieved with the integration of what is called an ^ec- '^"' 
trdnic not^ook. 

Open, flexible, adaptive. Problem strategies require be- 
ing able to integrate novel ideas. A sophisticatied PSE build- 
ing tool must be able to tailor or add new functionality 
within its existent base. 

Graphical, visual. The use of graphics and visuals can 
enhance the usability of the PSE, for example, through ani* 
mated tables and directed graphs to visualize the state of the 
application. Furthermore, it must be possible to integrate 
custom-designed graphical and visual inputs and outputs. 

23, Portal for Problem Solving Environments 

A **Web portal** is commonly defined as an entry point 
or starting site for the World Wide Web, combining a mix- 
ture of content and services that attempts to provide a per- 
sonalized liome base** for it*5 audience. Features include 
customizable start pages to guide useis easily through the 
services provided by the portal. Such seivices include fil- 
terable e-mail, chat rooms and message boards, person- 




Figure 1. A computing portal interfaces 
clients with Grid resources such as stor- 
age servers, supercomputers, and worksta-- 
tion clusters. 



alized news, gaming cbannets, shopping capabilities, ad- 
vanced sifaircb engines^ and personal bomepage constniction 
kits. Exiinples for consumer-oriented portals are provided 
by AOL amd Yahoo. 

In this spirit, we suggest that a convenient way of inter- 
facing with h PSE is to design portals for a scientific domain 
or a particular proWcm strategy. Besides providing collate 
oralive» interactive, and information services, such portals 
include also services that are' unique for the domain but are* ' 
typically not provided by consumer-oriented portals. These 
services include interfaces between users of the PSE with 
the help of ctients ran^g from grai*ics workstations to 
palm pilots to the resources available as part of the compu- 
tational Grid (Figure 1). Naturally, not all capabilities of a 
portal may be exposed by less capable access devices such 
as palm pilots. Nevertheless, the ability to send a message to 
a beeper, palm pilot, or cell phone adds significant value to 
the PSE functionality by notifying the user of the existence 
of a collaborative session or the completion of a problem 
sohition. Hence, the ability to access a portal with various 
(even less capable) devices is an Integral part of our design. 

2.4. Users and Usage Modes of PSE Portals 

Portal development for PSEs first requires determining 
which customer group will be using the portal. We distin- 
guish three target groups: 

1. Novice science or problem solviog environment users, 
that is, casual or novice users using readily available 

' solutions to problems. The problem strategy is non- 
transparent to novice users. 

2. Expert science or problem solving environment users, 
that is, users in the domain for which the portal is 



. developed. Such users are able to extend the por- 
tal while providing solution strategies as tised by the 
novice users or themselves. 

3. Developer of application or problem solving environ- 
ments, providing general-purpose components used by 
experts or novice users. 

In addition, we distinguish l>etween interactive and batch 
mode in which jobs are submitted from the problem solv- 
ing environment to the backcnd systems by the users. We 
have to be able to support the use of compute resources 
through fine-grained parallel programs, typically provided 
through MPI message-passing parallel programs, or coiarse- 
gr^in parallel programs through job. dependerncies between 
jobs submitted to the batch processing systems or a fork 
jobmanager. The toolkit we describe in this paper supports 
these usage modes. . • 

3. Architecture 

Because of the diversified use of a PSE portal, the archi- 
tecture of such an environment must be flexible. Thus, it is 
not feasible to develop a point solution for a single problem. 
Needed instead is a portal toolkit that includes a set of ser- 
vices exposed via APIs that can be used to assemble a point 
solution for a problem. I^gure 2 and Table 1 outline the^var- 
ious groups of services that- we initially focus oh and that 
must be integrated into a'p(»tal toolkit.* Eac^ portal compo- 
nent may have several subcomponents that support the tasks 
performed as part of the computing portal for problem solv- 
ing enviroiunents. The components in bold text of Figure 
2 are developed as part of the CoG Kit Other components 
are provided either by commodity soAware or the applica- 
tion programmers. The flexible design makes it possible 
to integrate new components into the framework or replace 
existing modules. 

3.1. Grid Core Services 

.The scientific problem-solving infrastructure of the 
twenty-fiist century will support the coordinated use of nti- 
merous distributed heterogeneous components, including 
advanced networks, computers, storage devices, display de» 
vices, and scientific instruments. The term **Tbe Crid** is 
oAen used to refer to this emerging infrastructure (5]I6]. 
NASA's Information Power Grid and the NCSA Alliance's 
National Technology Grid are two contemporary projects 
prototyping Grid systems; both build oo a range of tech- 
nologies, including many provided by the Globus project in 
which we are involved. In designing PSE portals, we make 
extensive use of these technologies, including Globus j 
vices, such as 
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Figure 2. A computing portal is built with 
the help of a variety of porta! components 
ranging from specialized application- spe- 
cific portal components to components for 
using distributed compute resources or other 
Grid Infrastructure. 



Table 1. Portal Components 



Portal Com- 
ponent 


Sub Compo- 
nent 


Function 


Cotlaboiatum 




I'Uleo cotlaborqiion (e^ 




c-ticker 


newsstnrcr 


Design 




JavuiDE (e.g. PisuaUann, 
Forte, ^) 






JtnaJDB 


Scimce 




Application specyic pny- 
ynded byseientisis 


Compntf Re- 
source 


p-Crader 


locates conputc rcsoiirecs 




p-brokrr 


schoduJes jofaa 




p-fkw 


<lc pmdcDcio bctweoi jolia 




p~<lcbug 


debug? job executioo 




gram 


Globus Job submission 


Security 


p-crypl 


scads secure messages 




P- 

authcDticate 


Bttdicnticatcs lo die aystciu 






Grid Security infrastntcturt 


AdminUtrallon 


p-tnstaller 


instalb sofhvarc on client 


Monitoring 


p-iDODitcr 


iDOoitors (he Stale 




mds 


Globus Metacomputing Di' 
rectory Service 


Display 


fwcndcnr 


displays iofonnatipo from 
XML 



• (he information service (MDS), which enables unifonxi 
access lo tnforniaiioD about the structure and state of 
Grid resources; 

• an authentication and authorization service (GSI).' 
which provides mechanisms for establishing^ identify- 
ing, and creating delegatable credentials; and 

• a uniform job submission service across distributed 
scheduling systems (GRAM). 

These Grid services are often termed *^nidd]eware*^: they 
typically involve a distributed state and can be viewed as 
a natural evolution of the services provided by today's In- 
ternet. They buQd the basis for developing a Grid-based 
problem solving environment t>ecau5e many of the portal 
componenls use their services.' 

3^. Job Submission and Execution 

One of the main services a PSE portal must provide is 
to job submission to remote resources. This must be done 
in seamless fashion from the desktop with a single sign-on 
authentication. Computers must be located and the compu- 
tation must be started on the selected systems. It is essentia] 
to monitor the progress of the job execution and obtain the 
results of the calculaticm through, for example, output files 
that may be manipulated locally on the client side (the com- 
puter from which the job was initiated). We are able to sup- 
port such uniform job submission while using the Globus 
metacompttting toolkit to access Grid resources securely. 

Authentication The first step of the job submissiim is to 
authenticate with the system. Authentication is the pro- 
cess to verify the identity of an entity. Alttiough the cryp- 
tographic algorithms that form the basis of most secu- 
rity systems-such as public key cryptography^are relatively 
simple, it is a challenging task to use these algorithms -to 
meet diverse security goals in complex, dynamic problem 
solving envircxunentSy with potentially large and dynamic 
sets of users and resources and fluid relationships between 
users and resources. Authentication solutions for problem 
solving environments in a Computational Grids must solve 
two problems not commcmly addressed by standard authen- 
tication technologies. 

The first problem is support for local heterogeneity. The 
resources available in the Grid are operated by a diverse 
range of entities, each defining a different administradve 
domain. 

The second problem support for N-way security con- 
texts. In traditional client-server applications, authentica- 
tion involves just a single client and a single server. In 
contrast, a Grid-based PSE may require and dynamically 



maintained resources. Thus, it must be possible lo estab- 
lish a security relaliooship between any two processes in 
the computation used to solve the problem even if they are 
in diflerent administrative domains. To simplify our task 
we use the Grid security infrastructure (GSI) that deals with 
the authentication. GSI policy allows a user to authenti- 
cate just once per computation, at which time a credential 
is generated that allows processes created on behalf of the 
user to acquire resources, and so no, without additional user 
intervention. Local heterogeneity is handled by mapping 
a user*s Grid identity into local user identities at each re- 
source. In summary, the GSI security model provides PSEs 
the following advantages: single sign- on for all resources, 
no need for user to keep track of accounts and passwords at 
multiple sites, and no plaintext passwords. 

Protocol-based Job Submission Recently, Globus has 
been enhanced to include an HTTP-based protocol for job 
submission. Thus, job submission can be initiated from a 
client on which no other Globus components are installed. 
Figure 3 shows the Globus components that are involved in 
such a job submission. First, one has to authenticate with 
the system, which is done with the help of public key in- 
frastructure and a proxy delegation while generating a tem- 
porary key. Jobs are submitted from the client side through 
API calls known as gram-submit and gram^request. The 
gatekeeper on the Globiis-enabled resource verifies whether 
the user is allowed to submit a job to it and checks the avail- 
ability of the user's public key in a grid map file local to the 
resource. Once a job has been successfully submitted to the 
system, it is started with the help of the job manager, and 
its state is monitored with the help of the reporter. Dur- 
ing startup of a job a user can register callback handlers that 
provide job status updates. In our Java CoG Kjt we have im- 
plemented all components and services responsible for the 
proxy initialization and the job submission. Furthermore 
we have replaced the C-based callback service witti a Java- 
based event service. Thus, all components to submit a job 
are available in pure Java, allowing even Wndows clients 
to submit jobs to Globus servers. 

33. Additional Security Issues 

In the preceding sections we addressed security issues 
related to authentication and authorization while using the 
security policy suggested by Globus. The authorization to 
use a particular Grid resource can be controlled via a gnd- 
map file and appropriately specified group permissions con- 
trolled by the local system administrators. 

Nevertheless, we stiU have to address issues such as the 
secure communication between programs. To guarantee 
privacy, we use the security mechanisms provided by secure 
socket connections, which we can obtain through GlobuslO. 




Figure 3. The components of the Globus se- 
curity Infrastructure used during Job submis- 
sion. All client side components are available 
within the CoG Kit as pure Java components. 



This allows us to send messages and data in a secure fashion 
between compute resources. 

4. Java CoG Kit 

In the remainder of this paper we focus our attention on 
our Java CoG Kit prototype, which enables us to build the 
components listed in Table 1 and used as part of a PSE. Be^ 
cause of the large number of packages and classes required 
to expose the necessary functionality of the Globus toolkit, 
we focus in this paper on a subset of the classes that we 
deem most useful for the development of PSE-based Grid 
applications. The design of the Java CoG Kit is intended to 
facilitate the development of future components as a com- 
munity project To support an iterative process of definition, 
development, and apphcation of a Java CoG Kit in collab- 
oration- with other teams, we classify components in fonr 
layers. This categorization provides the necessary subdi* 
visions to coordinate such a challenging open commimity 
sofbvare engineering task. 

Low-Level Grid Interface Components provide map- 
pings to commonly nsed Grid services: for example, 
the Grid information service (the Globus Metacom- 
puting Directory Service, MDS), which provides 
Lightweight Directory Access Protocol (LDAP) [91 
access to information about the structure and state of 
Grid resources and services; resource managemetU 
services, which support the allocation and manage- 
ment of computational and other resources (via the 
Globus GRAM and DUROC services); and data 
access services, for example* via the Globus GASS 
service [3]. 

Low-Level Utility Components are utility functions de- 
signed to be reused by many users. Examples are com- 
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Figure 4. This sample script dejn^instrates 
how we access basic Grid servi^es' wlth the 
help or the Java CoG Kit. Here-data for a 
structural biology code called SnB are lo- 
cated, an appropriate machine is selected, 
and the calculation Is executed on that ma* 
chine* 



ponents that use infonnation service functioos to find 
all compute resources that a user can submit to, that 
prepare and validate a job specificatioo while using 
the extended markup language (XML) or the Globus 
job submissioD language (RSL), that locate the geo- 
graphical coordinates of a compute resource and that 
test whether a machine is alive. 



Low-Level GUI CoDipoDents provide a basic graphical 
coanponeDts that can be reused by application develop- 
ers. Examples are LDAP atuibute editors, RSL editor^ 
LDAP browsers, and search componests. 



Application-speeilic GUI ComponeDts simplify the 
bridge between appUcations and the basic CoG Kit 
components. Examples are a stock market monitor, 
a graphical climate data display component, or a 
specialized search engine for climate data. 



Figure 4 shows how a small set of services provided by 
the Java CoG Kit may be used ld practice. This Java pro- 
gram skeleton demonstrates how simple it is to buiJd portal- 
specific services when accessing a variety of basic Grid ser- 
vices through the Java CoG Kit. In this example, an appro- 
priate machine is selected for execution, data for an instanti- 
ation of a problem specific algorithm is determined, and the 
job b executed on that machine, resulting in the generation 
of an output file. 

4.L Low-Level Grid Interface ComponenCs 

We describe here a subset of packages that provide the 
interface to the low-level Grid services and application in- 
terfaces. These packages are used by many users to develop 
Java-based programs in the Grid. We describe only the gen- 
eral functionality of these packages. A complete Ust of the 
classes and methods accompanies the distribution [18]. 

RSL The package org,globits.rsI provides methods for 
creating, manipulating, and checlsing the validity of the Re- 
source Specificatioo Language (RSL) expressions used in 
Globus [8] to express resource requirements. As shown in 
Step 3 of Figure 4, the arguments to a new call include pa- 
rameters that specify both characteristics of the required re- ■ 
sources and properties of the computation. 

GRAM The package OFg.globus.gram provides a map- 
ping to the Globus Resource Allocation Manager (GRAM) 
services [8], which allow users to schedule and manage re- 
mote computatic»is. The classes and methods distributed 
allow users to submit jobs, bind to already submitted jobs, 
and cancel jobs on remote computers. Other methods allow 
users to determine whether they can submit jobs to a spe- 
cific resoiuce (through a Globus gatekeeper) and to monitor 
the job status (pending, active, fa tied, done, and suspended). 
As shown in Step 4 of Figure 4 the class Gram is used 
to create a job with an RSL string describing the job and a 
machine contact that detennines on which machine the job 
is requested for execution. Our Java mapping differs firom 
that provided in Globus for C through the introduction of a . 
formal job object, as well as the availability of a sophisti* 
cated event model in Java. Our implementation utilizes this- 
event model and transfers the C callbacks into equivalent 
Java events. In Java one can now use threads in order to 
'^listen" to a particular event that can trigger further actions. 
A Java interface GramJobLictener that contains the 
method 6tateChanged<GrainJob job) can be used 
to define customized job listeners that can be added with the 
GramJob method addListener (GramJobListener 
listener) . 



* • MDS The r^ckage org,g!obus.mds simplifies access to 
the Mctacoinputing Directory Service (MDS) [15^ which 
is an impoTlanl part of the Globus infonnation service. Its 
funclions include (a) establishing a connection to an MDS 
server, (b) querying MDS contents, (c) printing, and (d) dis- 
connecting from the MDS server. The package provides 
an intcnnediate application layer that can be easily adapted 
to difTcrcnl LDAP (9) client libraries, including JNDI [lO], 
Netscape SDK 1 1 1 ], and Microsoft SDK 1 1 3]. 

As ihown in Step 1 of Figure 4, the parameters to initial- 
ize the MDS class are the DNS name of the MDS server, 
the port number for the connection, and the distinguished 
name (DN) that specifies the root for a search in the direc- 
tory tree. A search is performed in Step 2a;' the first param- 
eter specifies the top level of the tree in which the search is 
performed, the second parameter specifies the LDAP query, 
and the third parameter specifies the scope, that is, for how 
many levels in the tree the search should continue (in our 
case, oDly the next level). Search results can also be stored 
in a NamingEnumeration provided by JNDI. 

GASS The Global Access to Secondary Storage (GAlSS) 
service (3) simplifies the porting and running of applica- 
tions that use file I/O, eliminating the need to manually log 
onto sites and ftp fiiles or to install a distributed file system. 
The package org-giobus.gass provides an essential subset 
of GASS services to support the copying of files between 
computers on which the Orid Services arc installed. The 
method g€l(Stringfrom, String to) copiey-a remote file to a 
local file, and the method put(Stringfix>m, String to) copies 
a local file to a remote location. The/eicA method used in 
our example (Figure 4) provides a convenient wrapper and 
uses internally the previously mentioned get method. 

4^. Low-Level Utaities 

The low-level utility classes currently defined in the CoG 
Kit provide an abstract datatype representing acyclic graphs 
and basic XML parsing routines. The graph class is used, 
for example, to access dependencies between jobs, a major 
requirement for PSEs. The XML classes arc used to pro- 
vide transformations between different data f<»Tnat5. Using 
XML has the advantage that a Document Type Definition 
(DTD) that is defined for these data formats can l>e used to 
verify whether a record to be transmitted is well formed be> 
fore it is sent to a server. Thus the load on servers can be 
dtramaUcally reduced The availability of a dependency be- 
tween jobs is a significant extension to the existing Globus 
low-level application interface. In addition, we have defined 
a general concept of a machine and job broker interface. 
This enables a progranomer to define a customized selection 
of machines and jobs dependent on bis demand. We have 
used this technology as part of a high-throughput broker 




Figure 5. A broker interface allows us to 
specify an easy way to develop compatible 
components relying on this interface. Jobs 
and machines are selected based on a pre- 
''defined access/security policy as well as a 
scheduling policy. The policies may be gen- 
erated dynamically based on other system In* 
formation. 



that is implemented in Java but can also exposed through 
CORBA objects. The GBCCO application introduced in 
Section 4.4 uses the Java-l>ased machine and job brokers. 

The broker is a good example of a universally usefid 
component for PSE developers, as well as Grid users. Here 
a set of jobs and machines is stored in two tables. Depen- 
dent on a scheduling and access policy, a machine is se- 
lected and a job is scheduled for the execution on this ma- 
ctiine (see Figure 5). We have deimed a simple interface 
outlineid in Figure 6. This interface allows us to add jobs and 
machines to the sets so that it is possible to administer them 
dynamically. With the help of this interface we have defined 
multiple scheduling policies such as first-comc-flrst-servcd 
and load balancing based on resource characteristics. Cur- 
rently we are investigating the use of economy models for 
scheiduUng jobs to machines; 

43. Low-Level GUI Components 

The Java CoG Kit low-level GUI components provide 
basic graphical components that can be used to build more 
advanced GUI-based applications. These components in- 
clude text panels that format RSL strings, tables that display 
results of MDS search queries { 1 7], trees that display the di- 
rectory information tree of the Na)S, and tables to display 
HBM and network performance data. Each component can 
be customized and is available as JavaBean. In future re- 
leases of the Java CoO Kit it will be possible to integrate the* 
bean in a Java-based GUI composition tool such as JBuilder 
or VisualCafe. 
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btcr&cc broker ... { 
addJob(ioM>cscnpuon job) 
dclcteJob(JoU>esCTiptioo job) 
addMachine(MachiiicDescripiioo machine) 
delctcMachine(Mad>ineDcscnpiion macbme) 
sctAcoessPolkyCBrokcrAccessPolicy polity) 
' sclScbcdulingPolicy(BrolcefScbeduliqgPoli^ po<*9) 

MachincDcscripiioii gctMachineO 
JobbesoripiiiHi gcOobQ 

) 

Figure 6. Thfs code fragment shows the ele- 
mentary methods of the broker. Jobs and ma- 
chines can be added. The job and machine 
returned by the get methods are defined by 
the policies and the algorithms defined by an 
object instantiation of the interface. 



4.4. PSE Application Level Utilities and GUI Com- 
ponents 

High-level graphical applications combine a variety of 
CoG Kit components to deliver a single application or ap- 
plet. These applications can be combined to provide even 
greater functionality. Hie user should select the tools- that 
seem appropriate for the task. To demonstrate the range of 
applications, we have included a set of screen dumps tliat 
highlight the look and feel of some apph'cations developed 
to date. 

GECCO The Graph Enabled Console Component 
(GECCO) is a graphical tool for specifying and monitoring 
the execution of sets of tasks with dependencies between 
them (16][]4]. Specifically it aUows one to 

1. specify the jobs and their dependencies graphically or 
with the help of an XML-based configuration file; 

2. debug the specification in order to find erroneous spec- 
ification strings before die job is submitted; and 

3. execute and monitor the job graphically and with the 
help of a log file. 

As shown in Figure 7, each job is represented as a node in 
the graph. A job is executed as soon as its predecessors are 
reported to have successfully completed. The state of a job 
is animated with colors. It is possible to modify the specifi- 
cation of the job while clicking on the node: A specification 
window pops up allowing the user to edit the RSL, the la- 
bel, and other parameters. Editing can also be po'fonned 
during runtime Qob execution), hence providing for simple 
computational steering. 




Figure 7. The Grid Enabled Console compo- 
nent (GECCO) allows the user to specify de- 
pendencies between tasks that are to be exe* 
cuted in the Grid environment 



Hlgh^Througbput Broker' We have developed a i^-oto- 
type of a high-^oughput broker to test whether die inter- 
faces and classes allow one to easily generate high-level 
components that simplify job maiiilenance taslcs for certain 
problem-solving strategies. One of the tasks that has been 
identified and is common to many solution strategies is to 
perfonn a parameter study (2][4]. That is, an algorithm is 
repeatedly executed with a variety of parameters. Our sys- 
tem is based on the interface of a broker and thus allows us 
to clearly separate the GUI presentation from the function- 
ality (Figure 8). The prototype looks for compute resources 
available in a pool of machines formed by a Grid informa- 
tion service with the help of the Globus MDS. From this 
pool we select those resources that are idle and are avail- 
able for calculation. If a resource is not able to fulfill a job 
(because of connection timeout or excessive time needed 
to complete the job)» the resource is autcnnatically removed 
from the set of viable candidates. The set of resources as 
well as those removed from the list can be manipulated 
through an interactive shelL A similar interface exists for 
the jobs. Special attention has to be placed on the imple- 
mentation of such a broker. Although it is possible to spawn 
for each job and macbine a thread that maintains the appro> 
priate object, we have chosen to maintain the jobs and ma- 
chines in lists to avoid the overhead associated with threads 
and the expected resource limitations on the machine on 
which the system is running. Thus, we are able to handle 
submissions that maintain 10,000 or more jdb5» a task that 
would otherwise be impossible. 

5. Installation and Upgrading 

An important function that must be provided by a PSE 
is to install and upgrade the sofhvare that accesses the var- 
ious services exposed as part of its design. Using Java will 
provide us with several options for deploying our client soft- 
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Figure 8. A high throughput broker allows the 
submission of many Jobs as part of a prob- 
lem. After all Jobs are completed a solution of 
the problem can be obtained. The progress 
of the calculation Is monitored with a GUI. 



ware. In addition to traditiona] methods of delivering client 
software to be installed and cooiigured prior to its use, we 
can develop thin-client software, which can be dynaniically 
installed or updated as well as loaded at time of use. 

Preinsiallaiion of the sonware in the fonxi of a stand 
alone application or a library is convenient for applications 
that would take top long to be installed via a network con- 
nection (Figure 9). This strategy is today used by many 
commercial portals as part of their access software enabled 
with the help of so-called browser phig-ins. Nevertheless, 
we recognize the fact that it is sometimes not possible to 
install any software on the client computer because the user 
does oot have sufficient access to it. This requires, at the 
cost of additional download time, downloading the appro- 
priate Jar files from a well-defined URL. In both cases it 
will be possible to augment the jar files with authentica- 
tion measures in the form of certificates. These will allow 
clients to identify the source of the code upcm downloading 
our software and to verify that it can be trusted for use on 
their systems. 

6. Summary 

CoiDmodity distributed-computing technologies enable 
the rapid construction of sophisticated client- server appli- 
cations. Grid technologies provide advanced network ser- 




Figure 9. The Installation of the CoG Kit onto 
a client can be done prior to the start of the 
application as a standalone application or 
the Installation of a library or during an on* 
demand execution. 



vices for large-scale, wide area, multi-institutional environ- 
ments and for applications that require the coordinated use 
of multiple resources. In the Commodity Grid project, we. 
seek to bridge these two worlds so as to enable advanced 
applications that can benefit from both Grid services and 
sophisticated commodity development environments. 

The Java Commodity Grid Toolkit (CoG Kit) described 
in this paper represents a first attempt at creating of such a 
bridge. Building on experience gained over the past three 
years with the use of Java in Grid environments, we have 
defined a set of classes that provide the Java programmer 
with access to basic Grid services, enhanced services suit- 
able for the definition of desktop problem solving environ- 
ments, and a range of GUI elements. Initial experiences 
with these components have been positive. It has proven 
possible to recast major Grid services in Java terms without 
compromising on functionality. Some sut>5tantial Java CoG 
Kit applications have been devel(>ped, and reactions from 
users have been positive. 

Our future work will involve the integration of more ad- 
vanced services into the Java CoG Kit and the creation of 
other CoG Kits, with CORBA, DCOM, and Python being 
early priorities. We also hope to gain a t»etter understand- 
ing of where changes to commodity or Grid technologies 
can facilitate interoperability and of where commodity tech- 
nologies can be exploited in Grid environments. 

With the help of the CoG Kits we have prototyped a 
portal to a structural biology problem solving environmenL 
Other projects are cunrently investigating the use of the CoG 
Kit to simplify the access to Grid resources. Such projects 
include the astrophysics portal Cactus^ the NCSA Userpor- 
tal, and SDSC Hotpage. The requirements demanded by 
such projects have influenced our present design* and we are 
collaborating with project developers to enhance the com- 
ponents we provide in the CoG Kit. Most recently, we have 
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started to address the integration of components developed 
by other collaborators. 
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Abstract Previous research eflbrts for building thread migration systems hove 
concentrated on the development of iramewoiks dealing with a small local cmri- 
ronment controlled by a single user. Cooiputational Grids provide the opportu- 
nity to utilize a laige-scale environment controlled over different org9nizationai 
boundaries. Using this class of large-scale computational resources as pan of a 
fhiead migratioo system provides a significant challenge previously not addressed 
by this .community. In this p^er we present a framework that integrates Grid ser- 
vices to enhance the functionality of a thread migration system. To accommodate 
liituie Grid services, the design of the firamewoifc is both flexible and extensible. 
Cunently, our thread migration system contains Grid services for authentication, 
registration, lobkup^ and automatic sofbvare installation. In the context of dis^ 
tributed applications executed on a Crid-based inftastructure, the asynchronous 
migiatian oT an execution context can help solve problems such as remote exe- 
cution, load balancing, aid the development <^ mobile agents: Our prototype is 
based on the migration of Java threads, allowing asynchronous and heterogeneous 
migration of the execution context of the running code. 



1 Introduction 

Emerging national-scale Computational Grid infrastructures are deploying advanced 
services beyond those taken for granted in today's Internet, for example, authentica- 
tion, remote access to computers, resource management, and directory services. The 
availability of these sendees represents both an opportunity and a challenge an oppor- 
tunity because they enable access to remote resources in new ways, a challenge: because 
the developer of thread migration systems may need to address implementation issues 
or even modify existing systems designs. The scientific problem-solving infrastructure 
of the twenty-first century will suppoit the coordinated use of numerous distributed 
heterogeneous components, including advanced networks, computers, storage devices, 
display devices, and scientific instruments. The term The Grid is often used to refer 
to this emerging infrastructure (5]. NASA's Information Power Grid and the NCSA 
Alliance's National Technology Grid are two contemporary projects prototyping Grid 
systems; both build on a range of technologies, including many provided by the Globus 
project. Globus is a metacomputing toolkit that provides basic services for security, job 
submission, information, and communication. 



The availability of a national Grid provides toe abUity to exploit f^^^^^^ 

as an J^^^^^^ hosts in a netwoik (or Grid), in order lo 

'^Tl'^Tf S^.^ eh'oS^^ essential part for developing mobile agent syj 
"^^cfv^ iLe saTof th^ nimiing program before it is uansponedtoihe newhost. 
'^J^rT£SiiZ^oL^^-i^^^ off. Mobile^gcnt sys|«« 
^Jirtom wSSStfon^ems in that the agents move when they choosj? J^J^ 

application as well as ^™ m the first part we introduce the thread 

. '^'t^.'^'KTr^^ond^^^^ 
migrauon '^^^J^J^ environment. In the diird part m« 

Tsummary oflessons learned and a look at future acuvitoes. 

2 The Thiead Migration System MOBA 

jobs t4][8H3]. The advantages of MOBA are threefold: 



jNSOocia«xp. 



Fig. I. THe M06A system components include MOBA places and a MOBA cennal servei: Each 
ccfiiponcni has a set of subccnnponents that allow thread migration between MOBA places 



3. Supp<irl for the execution of native code as part of the migrating thread. While 
c«iiistdenng a thread migration system for Grid-based environments, it is advanta- 
r ixius to enable the execution of native code as part of the overall strategy to support 
a Ufve and expensive code base, such as in scientific progranuning environments. 
S%( >tiA will, in the near future, provide this capability. For more information on 
tht» Mihjcct we refer the interested reader to [ 1 7]. 



2 I Ml IRA System ComponenU 

Ml )HA IS based on a set of components that are illustrated in Figure 1. Next, we explain 
the runciionality of the various componeiiis: 

Pl:ice. Threads are created and executed in the MOBA place component Here they le- 
ccivc external messages to move or decide on their own to move to a different place 
component. A MOBA place accesses a set of MOBA system components, such as m^ 
ajicr, shared-memory, registry, and security. Each component has a unique functiionality 
within the MOBA framewoiic 

Manager. A single point of control is used to provide the control of startup and shut- 
down of the various component processes. The manager allows the user to get and set 
the environment for the respective processes. 
Shared Memory: This component shares the data between threads. 
Registry: The registry maintains necessary information — both static and dynamic — 
about all the MOBA components and the system resources. This information includ[es 
the OS name and version, installed software, machine attributes, and the load on the 
machines. 

Security: The security component provides netwoik-transparent programming inter- 
faces for access control to all the MOBA components. 

Scheduler: A MOBA place has access to user-defined components that handle the ex- 
ecution and scheduling of threads. The scheduling strategy can be provided through a 
custom policy developed by the user. 
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the thread object are marshaled. Objects that are bound to file descriptors or other lo- 
cal resources are excluded irom a migration. In the final step, the execution context is 
serialized. Since a context consists of contents of stack frames generated by a chain of 
method invocations, the externa lizer follows the chain from older frames to newer ones 
and serializes the contents of the frames. A frame is located on the stack in a JVM and 
contains the state of a calling method. The state consists of a program counter, operands 
10 the method, local variables, and elements on the stack, each of which is serialized in 
machine-independent form. 

Together the facilities for externalizing threads and performing thread migration 
enabled us to design the components necessary for the KdiDBA system and to enhance 
the JIT compiler in order to allow asynchronotis migration. 
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2^ Design Issues of Thread Migration in JVMs 

In designing our thread migration system, we faced several challenges. Here we focus 
on five. 

Nonpreemptive Scheduling. In order to enable the migration of the execution context, 
the migratory thread must be suspended at a migration safe point. Sudi migration safe 
points are defined within the execution of the JVM whenever it is in a consistent state. 
Furthermore, asynchronous migration within the MOBA system requires nonpreemp- 
tive scheduling of Java threads to prevent threads from being suspended at a not-safe 
point. Depending on the underlying (preemptive or nonpreemptive) thread scheduling 
system used in the JVM, MOBA supports either asynchronous or cooperative migration 
(that is, the migratory thread determines itself the destination). The availability of green 
threads will allow us to provide asynchronous migration. 
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we developed a version of MOBA ihat infers the type from the value. Nevertheless, we 
recently determined that this capability is not sufBcient to obtain a perfect inference and 
validation method. Thus, we are developing a modified JIT compiler that will provide 
stack frame maps [2] as part of Sun's ResearchVM. 

3 Moba/G Service Requirements 

. The thread migration system MOBA introduced in the preceding sections is used as a 
basis for a Grid-enhanced version which we will call MOBA/G. Before we describe 
the MOBA/G system in more detail, we describe a simple Grid-enhanced scenario to 
outline our intentions for a Grid-based MOBA framework. First, we have to determine 
a subset of compute resources on which our MOBA system can be executed. To do 
so» we query the Globus Metacomputing Directory Service (MDS) while looking for 
compute resources on which Globus and the appropriate Java VM versions are installed 
and on which we have an account. Once we have identified a subset of all the machines 
returned by this query for the execution of the MOBA system, we transfer the neces- 
sary code base to the machine (if it is not already installed there). Then we start the 
MOBA places and register each MOBA place within the MDS. The communication 
between the MOBA places is performed in a secure fashion so that only the application 
user can decrypt the messages exchanged between thenu A load-balancing algorithm is 
plugged into the running MOBA system that allows us to execute our thread-based pro- 
gram rapidly in the dynamically maintained MOBA places. During the execution of our 

' projgram we detect that a MOBA place is not responding. Since we have designed our 
program with check*pointing, we are able to start new MOBA places on underutilized 
resources and to restart the failed threads on them. Our MOBA application finishes and 
deregisters from the Grid enviromnent 

To derive such a version, we have tried to ask ourselves several questions: 

1. What existent Grid services can be used by MOBA to enhance is functionality? 

2. What new Grid services are needed to provide a Grid-based MOBA system? 
y Are any technological or implementation issues preventing the integration? 

To answer the first two questions, we identified that the following services will be 
. needed to enhance the functionality of MOBA in a Grid-based environment: 

Resource Location and MoDiCoring Services. A resource location service is used to 
determine possible compute nodes on which a MOBA place can be executed. A 
monitoring service is used to observe the state and status of the Grid enviromnent 
to help in scheduling the threads in the Grid environment A . combination of Globus 
services can be used to implement them. 

AulbcDticationand Authorization Service. The existent security component in MOBA 
is based on a simple centralized maintenance based on tiser accoimts and user 
groups known in a typical UNIX system. This security component is not strong 
enough to support the increased security requirements in a Grid-based environment. 
The Globus project, however, provides a sophisticated security ihfiastructure that 
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3.1 Grid-ba^sed Registration Service 



One of the problems a Grid-based application feces is to identify the resources on which 
the application is executed. The Metacomputing Directoiy Service enables Grid appU- 
cation developers and users to register their services with the MDS. The Grid-based 
information service could be used in several ways: 

1 . The existing MOBA central registry could register its existence within the MDS. 
Thus all MOBA services would still interact with the original MOBA service. The ad- 
vantage of including the MOBA registry within the MDS is thai multiple MOBA places 
could be started with multiple MOBA registries, and each of the places could ea^y 
locate the necessary information from the MDS in order to set up the communication 
with the appropriate MOBA registry. 

. 2. The information that is usually contained within the MOBA registry could be 
stored as LDAP objects within the distributed MDS. Thus, the fimctionality of the orig- 
inal MOBA registry could be replaced with a distributed registry based on the MDS 
functionality. 

3. The strategies introduced in (1) and (2) could be mixed while registering multiple 
enhanced MOBA registries. These enhanced registries would allow the exchange of 
information between each other and thus function in a distributed fashion. 

Which of the methods introduced above is used depends on the application. Appli- 
cations with high throughput demand but few MOBA places are sufRcicntly supported 
by the original MOBA registry. Applications that have a large number of MOBA places 
but do not have high demands on the throughput benefit from a total distributed reg- 
istry in the MDS. Applications that fall between these classes benefit from a modified 
MOBA distributed registry. 
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3.5 Secure CommuDication Service 



The secure communication can be enabled while using ihe GlobusIO library and send- 
ing messages from one Globus machine to another. This service allows one to send 
any serializable object or simple message (e.g., thread migration, class file transfer, and 
commands to the MOBA command inteipreter) to other MOBA places executed under 
Giobus-enabled machines. 

4 Conclusion 

We have designed and implemented migration system for Java threads as a plug-in to 
an existing JVM that supports asynchronous migration of execution context. As part 
of this paper we discussed various issues^ such as whether objects reachable iirom the 
migrant should be moved, how the types of values in the stack can be identified, how 
compatibility with JIT compilers can be achieved, and how system resources tied to 
moving objects should be handled. As a result of this analysis, we are designing a JIT 
compiler that improves our cuirent prototype. It will support, asynchronous and het- 
erogeneous migration with execution of native code. The iiutial step to such a system 
is already acliieVed because we have already implemented a distributed object system 
based on the JIT compiler to siipport selective nrugration. Although this is an achieve* 
ment by itself, we have enhanced our vision to include the emerging Grid infrastructure. 
Based on the availability of mature services provided as part of the Grid infrastructure, 
we have modified our design to include significant changes in the system architecture. 
Additionally, we have identified services that can be used by other Grid application 
developers. We feel that the integration of a thread migration system in a Grid*based 
environment has helped us to shape fimire activities in the Grid conuntmity, as well as 
to make improvements in the thread mignition system. 
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Ab;;tract 

The rr%rafrh *9e%vnhtJ tn this paper is performed as 
part vf ihr €U'4m% Pn^irt t li mtroduces a new Grid ser- 
vice Cidird /#!/. <;rum that • . »mhtm's the abiiity of sending as 
infomtatum *rr%ur urut a% execution service. Pre- 

viously:, K»rh »rr\«,r« mw»r ^ni-tti-v ted and implemented 
within € ;/. ^ • /. ^ %/i ,f .4 • . . Jtjjrrent services with dif- 
ferent wnr * . 4 1 c o • r demonstrates a signif- 
icant simpittu aty^ tHe utt^rture while treating Job 
submissi4ft% imr**^if»,^ •4*»rnes alike. The advantage 
of our ser\ •. r «• th^t •t r*. • .it % backwards compatibility 
to existing iind w-n.. r • at the same time provid- 

ing forumrdt • nmr»atJ*%t*t\ thr emerging Web services 
world, Hart the M.*ri , **9uiat'ttd with'm this effort is al- 
ready reused ^» /*r • i*#tt nt €}pvn Grid Services Architec- 
ture prototypr tmt^»'mmu4ti»»n 



1. Introduction 

The Grid appnKu h \% an aniponant development in the 
discipline of computer sciciwc and engineering[30]. It 
is making rapid prt>gre:ks on wvcral levels, including the 
definiiion of icnninok>gy, the design of an aichiteatire 
and framework 1 1 3 J, the application in scientific problems 
[5,4J, and the crvaiion of physical insianiiations of Grids on 
0 producdon level 1 10. 3, 2). Grids provide an infrastnicture 
that allows for Hcxihlc. s«:urc. ccxirdinaied resource sharing 
among dynamic collectitms i>f individuals, resources, and 
organizations. 

Over the past few years, tlic Globus Project has devel- 
oped the Globus Toolkit |l8i that provides a basic Grid 
middleware toolkit, which includes elementary services to 
address Grid management issues related to resource man- 
agement, security, information, and data managemeru [30]. 
Two of the most imponani Grid services that are provided 
by the Globus Toolkit arc the information- service and the ' 



job execution service. 

The irtformarion service returns information about the 
capabilities and the state of the Grid infiastriKsure. The 
Gtobus Toolkit provides such an informattbn service called 
Monitoring and Directory Service (MDS) 131, 71, formerly 
known as Metacomputing Directory Service. 

Theycfr execution service controls die submission and 
execuUon of jobs on remote machines. The Globus Toolkit 
provides such a service under die name Grid Resource Allo> 
cation and Management <GRAM) service [9J. GRAM pro- 
cesses requests for execution, performs resource allocadon, 
monitors, and controls job execution. Furthermore, a lim- 
ited amount of infortnation related to the capabilities and 
availability regarding the job execuUon service for a Grid 
resource can be expc^ through an information provider to 
the MDS. This information includes, for example, the name 
of d>e queue, details about the nK>de of operation, and other • 
important features that may guide the ptoces& of job sal>- 
mission by the user. Authentication to MDS and GRAM 
are handled through die Grid Security Infrastnicciue (GSI) 

The information and job execution service have so far 
existed as separate services witfiin the Gbbus TooDcit Con- 
siderable software engineeriiig effort is necessary to imple- 
ment, maintain, and deploy these services while at the same 
time support interoperability. We ajgue that this complex- 
ity can be reduced significantly by alternative approaches to 
both protocol design and implementation. To test tfds hy- 
podiesis, we devek)ped a prototype Uiat promises a rignifi- 
cant simplification in all aspects previously mentioned. Wfe 
have termed our prototype InfoGram in order to acknowl- 
edge its dual purpose. 

Our research has the following objectives and goals: 

• Design of a simplified (Jrid sendee architecture to pro- 
vide a unified service for infonnation, monitoring, and 
job submissicn. 

• Develop this service while providing backwards com- 
patibility by adhering to standard Grid protocols. 
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• Support multiple information return request formats 
such as LDIF and XML. 

• Improve the reliability of the job execution and an a 
second phase white replacing the protocol used to per- 
form the Job submission with SOAP. 

• Provide an open framework that can be easily adapted 
to interact with Icical schedulers and extract informa- 
tion through custom designed information providers. 

• Provide a framework that is based on GSI and its ap- 
plication within the Globus toolkit to map Global Grid 
User Identifiers to local account names. 

• Develop this service while providi ng forwards compat- 
ibility to Web services. 

• Build a groundwork for a Web services based imple- 
mentation of Globus services. 

The rest of this paper is structured as follows: First, we 
discuss the Globus Toolkit services GRAM and MDS in 
more detail. We outline how the operational integration of 
these services is achieved in production Grids. Next, we 
present the enhancements to the GRAM service that allow 
construaing our InfoGram service. We demonstrate that 
our service provides a significant architectural simplifica- 
tion but at the same time provides enhancements currently 
not available in the Globus Toolkit. Additionally* we show 
that this new service can still be integrated into the existing 
MDS concept Finally, we outline how such a service can 
be used as pan of Grid applications. 

2. Execution Service 

To contrast our differences to the Globus GRAM it is 
necessary lo . revisit the architecture of the Globus GRAM 
service. The basic structure of a GRAM service (version 
1. 1 jc)and its intcracdon virith clients relevant for our discus- 
sion is depicted in Figure 1 . A GRAM service provides the 
basic functionality for secure and uniform access to remote 
computational resources. The functionality of GRAM can 
be explained as part of a typical three tier architecture. Be- 
fore we include our enhancements to this architecture (Sec- 
tion 6). we explain the functionality of each tier in more 
detail 

Qbnt Tier. A client can submit a job to a remote re- 
source and can check on its status either through polling the 
status of the job or through event notificatton to the client 
through the GRAM Service. To allow identification of the 
job, a job handle (often referred to GlobusID) is returned 
on job startup so that it can be used for later connection, 
including from other remote clients with appropriate autho- 
rization. For example, this job handle can be used to contact 
the job and issue a cancellation. 
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Figure 1. The Grann Architecture 



Middle Tien Internally, GRAM consists of a gatekeeper 
and a job manger. The gatekeeper is responsible for authen- 
tication with the client, performing a simple authorizadon 
based on mapping the authentication information into a lo- 
cal security context (e.g., a Unix login). After this initial se- 
curity check, it starts up a job manager that interacts theie- 
after with the client based on the GRAM protocol {horn 
now on referred to as GRAMP). Eadi job submitted by a 
client to the same GRAM will start its own job manager. 

Backend Her. Once the job manager is acdvated. it han- 
dles the communicadon between the client and the backend 
system on which the job is executed. The backend tier is 
easily portable to various scheduling systems. The Glolnis 
Toolkit services provide scheduling interfaces such as PBS, 
LSF, Condor, and Unix process fork [21, 18]. 

The GRAM service can be accessed with the help of a 
C or a Java applicadon interface. TMs interface includes 
the ability to specify a job runable on a particular resource 
with the help of a uniform Resource Specification Language 
(RSL). The RSL makes it possible to quickly and unifomdy 
specify jobs to be run as part of a Globus enabled Grid. 
Simple tools are available to access the l>asic functionafity 
also from the command line. 

Although, we have in the past demonstrated mechanisms 
and protocols for ^plication states and notification, such 
advanced funcdonality [32] has not yet been included in the 
Globus ToolkiL 

3. Infomnation Service 

The basic structure of a Grid information service is de- 
fined in [31] and was further refined in [8]. A Grid informa- 
tion service requires: 

• access to static and dynamic information regarding 
system components and services. 
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• a framework that fits well with the heterogeneous 
and dynamic nature of Grids, including decentralized 
maintenance and operation* 

• scalability and performance, 

• integration of a variety of information providers. 

The Globus Project has developed a basic information 
service that addresses these requirements. The Globus Grid 
information service, MDS, contains two fundamental enti- 
ties: distributed information providers and information ag* 
gregates. An information provider is a service that provides 
a subset of useful information about resources exploited by 
Grid users or Grid services. Examples of information that 
may be accessed through such an information provider is 
CPU, operating system, network, and file system informa* 
tion. 

Additionally, the aggregate service is used to integrate 
a set of information providers that may be pan of a vir- 
tual organization [14). To increase the scalability of a dis- 
tributed informadon service, the MDS provides an informa- 
tion caching funcdon that allows viewing and querying the 
information about a resource from a cache. Furthermore, 
the newest implementation of a Grid information service 
thai implements the framework proposed by the MDS con- 
cept integrates GSI to perform authendcaiion. 

The information contained within MDS can be queried 
and used to enable more sophlsdcated Grid services. More 
details about the protocok^ the serWces. and the newest 
nomenclature can be found in [17, 7]. 

The research within this paper concentrates on the in- 
formation provider itself, as we can create information ag- 
gregates through reuse of information providers to improve 
scalability. Furthermore, we argue that it is worthwhile to 
provide google-like services, as have been used in many 
previous Grid like projects {28, 29, II). 

4. Using GRAM and MDS in Production Grids 

Figure 2 shows how the GRAM and the MDS services 
may be used in a simple production Grid. Our Grid con- 
sists of one virtual organization that maintains a number 
of compute resources. Each compute resource hsis the 
Globus GRAM and the Globus Resoiirce Information Ser- 
vice (GRIS) that returns information related to the local re- 
source installed. 

In order for a client to perform a job execution and an 
information query, two different mechanisms for contact- 
ing these services must be used. Not only do the services 
operate through different ports, but they also use different 
protocols making the amount of code sharing for inteifHet- 
ing return values more complex. The instaJladon of both 
services required addlUonal sophisticaUon. We feel that the 




Figure 2. A sample interaction between a 
dlent, GRAMpBnd MDS 



use of different technologies is in contrast with the desire lo 
provide a minimal set of protocols and services for Grids as 
promoted by die Global Grid Fonim and the Glolnis Pkojea 
[14]. If we tfiink abstractly about Jo5 execution and an in- 
formation service, we must recognize diat diey are based on 
the same principle: A query formulated and submitted to a 
server followed by a stream of ir^ormation thai returns the 
result based on the query, 

5 Addressing Requirements for the Info- 
Gram Service 

We have designed our InfoGraro .<»^ioe according to a 
set of requirements determined by general software engi- 
neering practices which include factors such as quality, per- 
formance, reliability, securiQr, and ponalnfity. All of these 
factors must be addressed widiin the realm of Grids. Never- 
theless, we concentrate our efforts on die following issues. 

5.1* Performance 

An informadon and Job execudon service must perform 
their tiasks quickly. The ekipsed time between job request 
and job submission must be as short as possible. At die 
same dme. informadon within the system must be accessi- 
ble quickly. For example, it may be inefficient to execute 
each dme a user requests data the program creating the rfata 
or a query relayed to an external informadon service. A 
simple example win illustrate our poinL Assume we have 
a large number of clients that need to know the CPU load 
of a remote compute resouioe. It would be wasteful to ex- 
ecute the command requesting the load every sif»gle dme. 
Instead, it can be more efficient to cache this value whhin 
the information service, and only refresh diis cache value 
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periodically. In order to preveni staleness of mfonnalion 
we attach a time to live (TTL) value with the information. 
This value will tell us when a refresh of the information in 
the cache is necessary. 

52. Quality of Information 

Infonmiiiun within Grids may become quickly inaccu- 
rate. Wc often observe two cases. Case One: In the sim- 
plest form the information can be describe as binary system 
where the information is either accurate or inaccurate. Case 
Two: In many other siiuaiions the information may degrade 
over lime in a discrete fashion. Thus, it is not unreason- 
able to attach a degradation funaion with' the actual value 
of informaiton thai rcHects the degree of degradation. This 
funciion may be influenced by time, system state, or predic* 
lion funcik>n\ to dcnx'e a quality of information assessment 
Often it IS pi>sMhlc lo aticmpc to derive such degradation 
infomtation ihrt>utrh «>lvicr\jtu>n or through mathematical 
modeK while {x*r1iirminp \clf owrcciion based on observa- 
tion data. Thn i> nm unlike >«*phisiicated data assimilation 
as used III ^.litfMic io«.*«ii«ii iivji corrects its values based 
on a cnm|vin^Hi Nnftcrn cit^-rvaitons and prediction mod- 
els. The t|ujli*y i4 tn\tmj»*L^ Kx*omes important in case 
more sophi^K jit-j rr^niKr monj^emcnt strategies are de- 
veloped, ir I t^Mn art ^ffiNttr ^uch as **mean CPU load** 
from a CnJ int.»rmjii.*«* ^.r^Kc. ii is t>cnencial to have the 
quality of the tnl<^mjtMK« aiijkhcd. Knowing the standard 
deviation ur kn«*^tng th^ the a. curacy of the value is valid 
over the last hi>ur iif the U>4 day is an impcntant factor to 
create inurr Mi|4ii^iiJi(xlC«nJ services. 

53. Security 

Access Uf M.*rvK*e% vuch as the information and job ex- 
ecution needs to be perfi«rmed securely. The Grid Secu- 
rity Infrastruclurc (GSh pnwidcs us with an elementary 
framework for authcniicaiion. Nevertheless, authentication 
is only one problem i«i be addressed within Grids. In our 
framework, we strive lo include authorization that allows us 
to specify coniracuc such as **nllow access to this resource 
from 3 to 4 pm to user X.** Additions to GSI and the use 
of more sophisticated auihemication frameworks [27] may 
provide them in the future. 

5*4. Portability 

Protocol compatibihiy of these services is preserved with 
the Globus Toolkit while using the GRAM, and Grid Se- 
curity Infrastructure (GSI) protocols. Future activities will 
include the integration of commodity protocols (such as 
SOAP) to provide interoperability to Web services and 
greater acceptance outside of the Grid community [15, 36, 
38,37]. 



5.5. Flexible and Extensible Information Model 

One of the issues we face with information providers is 
the lack of a standard that is uniformly adhered by the cont- 
munity. We observe the use of CIM, MIB, MDS, or non 
standard or unorthodox display of information in tables. Air 
though we believe that the creation of a consistent informa- 
tion model is an important one, we focus within this paper 
on the mechanism of delivering that information to the usei: 
The reasoning for this strategy is th^ our InfoGram service 
provides the necessary mechanisms for delivering the infor- 
mation according to the information model used within the 
information provider. Our positive experience with the use 
of XML schemas as ba^s for the next generation of Infor- 
mation services makes us believe that it provides a viable 
alternative to the currently used LDAP schemas. Compati- 
bility can be maint^ed while developing strict guidelines 
for the object definiuon by the Global Grid Fbnim. 

Nevertheless, we believe diat an addidonal requireihent 
must be fulfilled to erihance the use and acceptance of Grids. 
We t>eljeve that the execution of untrusted applications ia 
trusted environments is important to enable the use of Grids. 
We hope that through this feature the user community vnW 
increase dramatically based on software thai is developed 
as part of our activities. 

Providing such software will enable the creation of in- 
frastructures that will promote Grids in new communities, 
which previously did not have the luxury to access high end 
resources. Besides making access to supercomputer cen- 
ters for out^de users much more feasible, we foresee that 
resource providers may be more willing to contribute re- 
sources otherwise not part of the national-scale Grids. 

6. InfoGRAM Architecture 

As pointed out earlier, we modified the architecture of 
the GRAM server and enhanced it substantially in order to 
fulfil] the requirements described earlier. We added to the 
original architecture additional components, as shown by 
the shaded components in Figure 3, and describe these en- 
hancements based on the functionality they are providing. 
These functionalities are centered on client interaction, log- 
ging and check pointing, job execution, information man- 
agement, and configuration. 

Logging and check pointing is enabled through a logging 
service. This service can receive logging events from sev- 
eral componertts. The log can either be stored in the middle 
der. or on the backend tier. In dther case the log can be 
used to restart our InfoGRAM service in case it needs to be 
restarted (e.g. the machine was shut down). In the same way 
it would be possible to use the logging service for check 
pointing of applications. Presently, we only record mini- 
mal information such as the command used and arguments 
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Figure 3. The InfbGram architecture that com- 
bines a GRAM service with an Information 
Service using only one protocol between 
client and server. 



cAtx-uttrtl. We intend lo use this logging service Co provide 
simple Cirid accounting. 

M» Job execution 

IIk* c%icutN>n of jobs is made more robust while inte- 
rr jiiii}: J i; fj^tnjgi and faiili tolerance mechanism tha allows 
rcuufi J |« th uptin failure, 

Inftfrmulion Service 

\v fnrnii. »nrd prcvbusly, the InfoGram Service contains 
M-%i*ral n.»%cl IjaiurvN in regards to the information service 
pjfi It pi»%i«JcN ThoM; fcaiurcs are: An Information service 
tUM •ttirer.iiril with GRAM providing backwanb ccmi- 
fuiihili?% MJ >S. ami ftuppon of information caching and 
iIh: rrtrK'x^ «•! clementaiy information associated with the 
li fi*. If ri-M Hircc Additionally, we arc integrating in our ser- 
\Kv ih* ffi-^iuffr t»r inliirmation degradation and selfadapta- 
iii*n «»l ifii.^fiioiHtn updates as discussed earlier. 

Oui int.^tiiuium scr\ ice is architeaed with two compo- 
iK-ntM%ct* I i^'urc }t. the system monitor and the system in- 
i.^fTi-m m Kv. Ihe monitor service controls initializing 
jiiJ«. aImii]: tiK- icMihs requested by the clients. The system 
inl«<fni.*iNin «ice rciums relevant infomaiion about the 
sx^wnt rcM-urce!^. through either (a) calls to a system com- 
nuiid %ij the J4%-a runtime exec (b) a query to a functkm 
c\f^*s»ni£ Jj\a runtime information such as load, memory* 
%n disk spjce to or a read Function from a file that is used 
h> un inlormation provider. A good example for an infor- 
ntatutn provider is the Linux proc file system. As we have 
chosen an object onentcd framework for our implementa- 
litm. the inieizraiion of new information providers can be 
pcrtornvd through the implementation of interfaces. This 



will allow us to be able to provide a flexible and extensible 
information services framework. 

class SystemlnfonnatioD interface { 
string gecKeywordO ; 
void set Keyword () ; 
Object gueryStateO ; 
Object upda testate 0 ; 
Time ttl {) ; 
Int validity () ; 

Public void setDelayCTime time) ? 
String setFormat (Format format); 
Time getAverageppdateTitneO ; 



This interface allows us to generate new information 
providers in a fashion very similar to the current MDS 
model and its implementation. The method querySt ate 
is non blocking and returns valid information only when 
the information has been queried previously and the time 
to live (tti) value has not expired. Otherwise, it throws 
an exception. Upon invocation of the updateState 
method, a blocking method is called that returns the ap- 
propriate information white also updating the time to Uvc 
value. If multiple updateState methods ore invoked, 
monitors are used to perform only one such update at a 
time. Additionally, we provide a delay that controls how 
many milliseconds must pass between consecutive calls of 
updateState before die actual information is obtained 
through a runtime exec call. This is useful in cases where 
users ask for information more frequently than it can be pro- 
duced by die system. 

63. ConliguratHm 

We provided a configuration component that allows us 
to setup the InfoGram service with ease. This component 
includes the possibility to configure the system monitor ser- 
vice with customized information providers similar to the 
MDS. This configuration file contains the foUowing param- 
eteis! 

TTL: the lifetime in millisecond of each data generated by 
the specific key word; 0 specifies execution of the key- 
word eveiy time it is requested. 

Keyword: die keyword that wiU be used in an RSL string 
to identify the mapping to a real program or a Java 
application to be executed in the background. 

Executable Path: die full executable path and hame with 
arguments, machine dependant that is associated with 
the keyword. 
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Table 1. The InfoGram configuration file pro- 
vides a mapping between keywords and in- 
formation providers 

TTL Keyword Command 

60 Date date-u 

BO Memory /sbin/sysinfo.exe -mem 

100 CPU /sbin/sysinfcexe -epu 

0 CPULoad /usr/local/bin/cpuload^e 

1000 fist /bin/Is /home/£regor 

We provide an example for the information represented 
within such a configuration file in Table 1. As the keyword 
identifies the information obtained with ihe program we wiil 
refer to it from now on as a key information provider. Each 
attribute within a key information provider is augmented 
with a namespace conform to the keyword. Thus the at- 
tribute **totaI** in the "Memory information provider^ would 
be referred lo as Memory ; t oca 1. 

6.4. Caching and Information Degradation 

The caching functionality is similar to that of the MDS 
2.0. Nevertheless, queries to the informadon service are 
simple a]|-or*nothing queries based on the keywords used 
within the configuration files. That means, all attributes that 
are obtained through the command associated with a key- 
word will be returned. Based on this simple modeL the 
caching of information is easily possible. Additionally, we 
have the o|>tion to augment each attribute that is returned 
within a key information provider with a degradation func- 
tion or a quality of information value. Selecting similar in- 
formation attributes can then be performed on the quality of 
the information provided. 

5.5. Service Reflection 

Each information service can l>e queried and a client may 
inspect the schema that is returned by the information ser- 
vice. Thus it will allow developers to design programs that 
can be flexible to the actually used information schema. We 
believe that reflection and introspection of the capabilities 
of an execution and information service will become in- 
creasingly imponant with the increased number of available 
Grid services. 

6.6. Client Interaction through xRSL 

Although we developed our first prototype architecture 
as a Web service, we believed at the time that it would pro- 
vide to big of a depanure from the existing Globus Toolkit. 



We thought that mo.^ imponant for the acceptance of our in- 
formation service, is the recognition that the Globus Toolkit 
reached ubiquity within the community, and that the Globus 
protocols should be reused. 

Thus, as we wanted to maintain a degree of l>ackward$ 
compatibility, we decided not to chose a pure Web sendees- 
based implementation that uses only WSDL [36]. XML- 
schema [39], and XML query. We felt that such an effort 
could be performed in a second step (as it is now performed 
as part of the Open Grid Service Architecture [1])- Instead 
of using URIs to formulate job submission and information 
queries, we argued that users of the Globus Toolkit aie suf- 
ficiently familiar with RSL. Therefore, it was most natural 
to extend RSL with the more advanced features we have in- 
troduced so far. We added the following tags to the Globus 
RSL: schema, info, fiber, response, performance^ quality, 
format. We call the result xRSL. 

Info. The info tag is followed by the key as spec- 
ified in the configuration file, defining a mapping be- 
tween the keyword and the command to be executed. 
If it is set to (info«all), all cominands are ex- 
ecuted. Cbmmands can l>e seleaively queried while 
concatenating muldple info tag queries, for example, 
(inf o^Hemory) (in£o»CPO) . A qiecial value for die 
info tag is (info- schema) . This returns a hierarchical 
schema dwt contains all objects associated widi die key- 
words and lists properties of their attributes. 

Response. The response tag defines the behavior 
with respect to the informadon caching. Thus, with 
(responses immediate) the commands associated 
with the info tag are executed immediately regardless of 
the time to live. This will also update the cached values. 
Using (responsexcached) will return the informadon 
from the cache value If it is valid; otherwise it will update 
die cache first Using <reepon8e«la6t) ^Urenimdie 
value stored last in the cache without updating iL 

Quality. The quality threshold tag provides the possi* 
bility to specify a percentage numl>er that gives addtUonal 
guidance if a cached value should l>e returned or if die in- 
formation needs to be refreshed before return. Currendy, 
we define the following semantic. If the degradadon func- 
tion of any of its returned attributes is lielow that threshold, 
this attribute is regenerated by the associated command. 

Performance. The performance tag returns the num- 
t>er of seconds and the standard deviadon about how longil 
takes to obtun a panicular informadon value. The perfor- 
mance of a command and its attributed values is measured 
and catalogued during runtime. 

FonmaL The format tag defines the format in which the 
informadon is returned. The supported formats are LDIF 
and XML. Nevertheless, it is straightforward to support 
other formats such as DSML. 

Extensions. We are planning to extend our ex- 
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isiing limeoul tag with an additional action tag 
upon reaching this timeout. For example, the RSL 
(executable-command) (timeoutaioOO) 
(actionscancel) would cancel ihe command spec- 
ified through the RSL. while (action«except:ion) 
would throw an exception if the command has not com- 
pleted is execution* but the execution of the command itself 
would be continuing. 

Advantages. The advantages of this information stem 
from the simplification of the architecture bound to the de* 
livery of on integrated job submission and information ser- 
vice. Querying the information is handled by clients much 
as the cxecuuon of jobs. Moreover, this information service 
can easily be integrated into the Globus MDS information 
service architectuie. 

In summary, we have explored changes to MDS at the 
protocol and the implementation level At the protocol level 
we have replaced an LDAP search query wiih a "query" cast 
as a simple job submission through RSL. This new query, 
mechanism is based on At the implementation level, we 
have replaced the modular, configurable MDS inforaiation 
provider architecture with a less complex, even more mod- 
ular, configurable architecture that we believe fulfills, in a 
straightforward fashion.. the Grid designers quest for an easy 
to use and maintain information service. As pan of this im- 
plementation effort we have also explored more advanced 
features for dealing with caching of the information based 
on quality augmentations to the data itself. The result of our 
simplified architeaure is presented in Figure 4 and contrasts 
our eariier Figure 2. We believe that although the number 
of the components within our Info Gram service increased 
the overall complexity of the combined service is lower that 
the current provided solution. 




Figure 4. The new InfoGram service reduces 
the number of protocols and components in 
a Grid 



7. Implementation 

Although our services can be implemented in any other 
language we have chosen to prototype them in Java. It is a 
straightforward engineering exercise to implement them in 

. . The_Java platform enhances die functionality of our ser- 
vice based on the use of additional features that are other- 
wise not available in C Thus, we were able to achieve: 

• I>elivery of a pure Java Information and GRAM scr- ' 
_ vice providing cross^platform portability, which in- 
cludes the Endows Operatihg System. 

• Delivery ofa Web-enabled installation service that can 
deploy the InfoGram service with low overhead on in- 
stallation time and adniinistradve burden. 

• Execution of untrusted applications in trusted environ- 
ments on remote machines as part of the Java Virtual 
Machine modeL 

To suppon the development of the previously outUned 
service, we have performed significant enhancements to 
the Java CoG Kit that is maintained as pan of the Glol)U5 
Project. These enhancements are focusing on the job sub- 
misdon, deployment, logging, security, and information 
service. Whenever possible, we use standard Java packages 
to reduce the amount of codebase that must be maintained 
by us. This includes logging [20, 25] and security [19, 26]. 

In a first step, we have implemented a pure Java im- 
plementation a Globus GRAM [16, 9] service that pio- 
vides much the same functionality than its C-t>ased coun- 
terpart. In order to support interoperability and compatitnl- 
iiy, we based the design directly on the aichitectuie of the 
C GRAM service. It contains a gatekeeper, job manner, 
and a tocal job execution process. We name this service 
J-GRAM. 

JobSubmis^a. 

This Job Execution service within J-GRAM is protocol- 
compatible with the 'X:^RAM" distributed with the 
Globus Toolkit At present, we investigate the implemen- 
tation of major GRAM functionality, such as the support 
for gridmaps, which map user certificates to local user IDs, 
as well as the possibility to interfaces easily to scheduleis. 
We learned from this prototype that it is possible to provide 
a service in Java that mimics the behavior of C-GRAM. 

Besides the invokadon of executables from precompiled 
naUvc code, our J-GRAM service enhances the normal 
Globus GRAM service by being able to execute pure Java 
code submiued as Java jar files. To enable the execu- 
tion of jar files as part of the J-GRAM service, a variety 
of changes were necessary. We extended the functional- 
ity of the job manager to start up the code embedded in 
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a jar file that was submitted through an RSL call such as, 
(executableemyJavaApplication. jar) 

In order to run Java applications, one method is to exe- 
cute the code in the same JVM as the rest of the components 
arc running. An alternative is to separate the execution of 
the job into a JVM to increase security [19, 24]. We pro- 
vide the ability to configure the job manager to run in either 
of these modes. The Grid administrator must decide which 
mode should be run. The execution of system commands 
is performed through the runtime .execQ call. It is possible 
to redirect I/O to and from the cliem. The functionality is 
equivalent to the one from the C GRAM service with excep- 
tion that DUROC is not supported. As the Globus Project 
will replace it in the near l\sture, we have decided to refer 
to full delegation to a C Globus GRAM in order to provide 
this functionality . Therefore, it is still possible to start up 
MP1CII-G2 jobs 122]. 

DcpliiymenC We have demonstrated this service at 
SC2(lOI and featured the ease of installation of such a ser- 
vice white using the Java framework deployment methods 
tiMmt) as Web Stan. Using this advanced deployment pro- 
|ih:«i|. u-tf arc also able to maintain the upgradeability with 
HMin* f asc* and lo provide future solutions for automatically 
upt:raJtn|: $uch services in production Grids. This feature 
IV njiurjily supported while choosing Java as an implement 
t^ii*n arul dcpUiymeni platform. Such sophisticated ap- 
prn.k.hcN require much more cffon in tradidonaJ operaung 
%> Mrms 

i j»tu:tnc. Wc arc in the process of refining a logging 
fiHihjnixm li«r ibc execution of jobs assists in the fauhic- 
^M%cn jhilitic% t>f GRAM, as well as the possibility of log- 
ftnt juiheniMTatcJ information queries to guide the use as 
pofi i»t inii*tli|*cnt MThcduling services. 

Smirr S^indhtiving. In traditional programming Ian- 
irua^H v xiKh as C\ C++, and FORTRAN, it is difficult to 
rti-kutc unihisicU applications in a trusted environment sim- 
iLu ihc tmc ihe Grid provides.^With a JVM« however, we 
It* cf Mt»lc a truM rclatiufi between an unirustedclteni 
ai^lKaium to Ik* caccuicd in a trusted environment Addi- 
iiiHhilK. vkvrc able to package a gatekeeper with non- 
rrnn aci«r%\ n|!tH\ m a jar file that can be easily installed 
tn t*nc cn\ironmcnl. J-GRAM can be configured in various 
M ay\ Wc can cither execute each job in the already running 
IVM i»f vt.in up a numK*r of jexiemal JVM to execute a jar 
lilc in ail tf%cn nk>a' restrictive environment. 

riiriahilily. Other advantages (that arc based on the use 
ol Jav-Jiare the immediate availability of an information ser* 
vicv im the Windows operating system. Other benefits are 
imnxluccd by providing authorization mechanisms as part 
of this service, which can be supported by the Java plat- 
fiirm. 

inroGmm. In a second step we have prototyped much 
of the functionality described within this paper to enable 



the InfoGram service. We have obtained good experience 
to return information queries in LDIF and XML. 

8. Application 

Currently, the J-GRAM service has already been used in 
several projects, one of which is the emerging OGS A frame- 
work [ J 2] that has been developed after our investigations. 

We have tested our InfoGram prototype on an applica- 
tion that we have termed a sporadic Grid. Such a Grid is 
created just for a shoit period of time daring sophisticated 
experiments at synchrotrons or photon sources [35, 34]. To 
implement such a service we need a simple architecture that 
contains a set of advanced Grid services that are useful for 
supporting the creation and maintenance of sporadic Grids. 
Our InfoGram service provides such a service. As we are 
able to distribute it as a pure Java application, it will be easy 
to install it on a number of machines or access it through 
Web-browsers. 

We will extend our efforts to sui^)oit computatioiuilly 
mediated sciences [40). In this technique, a focused elec- 
tron probe is sequentially scanned across a two dimensional 
field of view a thin specimen, and at each point on the spec* 
imen a two dimensional electron diffracdon pattern is ac- 
quired and stored. The analyds of the spadal variadon in the 
electron diffraction pattern allows a researcher to study the 
subtle changes resulting from microstructural differences, 
such as ferro and electro magnetic domain formation and 
motion at unprecedented spadal scales. We will provide the 
computational Grid infiastnicture for these classes of exper- 
iments. 

9. Related Work 

Parallel to the research described in this paper, modifica- 
tions to GRAM 1 .0 were performed by colleagues within the 
Globus Project together with the Condor team at the Uni- 
versity of Wisconsin. This modified verdon of GRAM is 
available as pan of the Globus 2.0 release. We are proto- 
col compatible to that version. Most recently, the Globus 
Project, started together widi IBM on the Open Grid Ser- 
vices Architecture. Our work was performed before OOSA. 
Lessons learned from our activities should have influerK:e 
on the OGSA work. The current OGSA prototype imple- 
mentation uses the J-GRAM service, as well as the GSI se- 
curity provided through the Java GoG Kit [33]. 

10. Status and Future Plans 

The work perfonned within this research activiiy ex- 
plored new concepts that we expect to be considered m fu- 
ture Globus Toolkit developments. Future research activi- 
ties will include exploraUon of conceptual issues identified 
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within this paper, as well as their implementation as pan 
of prototype and Globus toolkit developments. On the con- 
ceptual level, we will investigate the explicit guidelines for 
system designers to choose the right cohfiguraiion for set- 
ting up the InfoGram Service with the appropriate param- 
eters and configuration files. We will perform further sim- 
plificatjons on the J-GRAM architecture while using only 
one port to communicate between job mangers and clients. 
For compatibility reasons, we have not yet been able to per- 
form this change. Improved fault tolerance will allow for 
automatic restart capabilities enabled through checkpoint- 
ing. We are improving our code and hope to integrate it in 
either die Globus Toolkit or the OGSA framework. Several 
features, such as the use of the perfoimance tag and the in- 
formation degradadon, are integrated at the moment. We 
are also experimenting with integraUon of our framework 
in Web services and JXTA [23]. 

11. Discussion and Conclusion 



We feel thai we have contributed to several areas within 
Grid computing. First, we idendfied that it is possible to 
design an Information system and a Job submission service 
that simplifies the architecture of the services provided by 
the Globus Toolkit. Through the extension of die RSL it will 
be easy for current Globus Toolkit users to adapt their code 
to use this information query. Second* we provide the pes- 
sibiiiQf of being protocol compatible to the Globus Toolkit, 
while being able to integrate our information provider in the 
existent MDS. Therefore, we provide the option to move 
to a different Informadon provider while enabling a grad- 
ual transition. New information providers could be inte- 
grated easily in this information service framework. Third, 
we already integrated in die current Java CoG Kit our J- 
GRAM service that allows executing untrusted applications 
in trusted environments. This service is naturally able to nm 
on Windows platforms and can be used to support sporadic 
Grids as defined in the paper. Forth, we set the stage for 
a multi protocol support for Grid information services that 
may export dieir data in LDIF or XML-schema. 

We presented suggestions for enhancing the Globus 
Toolkit and believe Oiat future development on Globus 
GRAM can benefit from our research on sporadic Grids. 
We believe diat the Open Grid Services Architechire will 
benefit from this work performed over the last year. In par- 
ticular the simplified InfoGram service can be used as an 
elementary replacement for a lightweight job execution and 
information service. It is str^ght forward to cast the Info 
Gram in WSDL. Considerable software engineering effort 
is necessary to implement, maintain, and deploy these ser- 
vices while at the same time suppon interoperability. 
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About Grid computing 




What would it mean if you could: 

• Analyze the value of an investment portfolio in minutes rather 
than hours? 

• Unite research teams with others around the world to take 
advantage of the most up-to-date learnings? 

• Significantly accelerate the drug discovery process? 

• Scale your business to meet cyclical demand? 

• Cut the design time of your products in half while reducing the 
instances of defects? 

Government labs and scientific organizations have been using grid 
technologies for several years, solving some of the most complex 
and important problems facing mankind. Now grid computing is 
becoming a critical component of day-to-day business. Today^s 
challenging business climate requires continuous innovation to 

differentiate products and services. Businesses must adjust ^ 
dynamically and efficiently to marketplace shifts and customer St 
demands. q 

IBM's response to these customer needs is what e-business on l^j 

demand is all about. There's a profound shift afoot in how computing D9 

is used — even in basic assumptions about how it's accessed and j 

paid for. Grid computing can bring tremendous productivity and ^ 

efficiency to organizations facing the challenges of an on demand > 

world. < 

IBM has practical information on grid computing UJ 

Find out what a grid is. ™ 
What is grid computing? 

Learn how IBM uses grid. 
IBM and grid 

Learn about the significant productivity and efficiency gains that 
grid can offer businesses today. 
Grid benefits 

Get answers to fi-equently asked questions for businesses just 
considering grid computing, as well as those taking the next steps in 
unleashing grid power. 
-* Frequently asked questions 
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Abstract. This document gtves an overview of a Grid tcstbed architec- 
ture proposal for the NorduGrid project (Ij. The aim of the project is 
to estabUsh an inter-Nordic* testbed facility for implementation of wide 
area computing and data handling. The architecture is supposed to de- 
fine a Grid system suitable for solving data intensive problems at the 
Large Hadron CoUider at CBRN (2]. We present the various architectuie 
components needed for such a system. After that we go on to ^ve a 
description of the dynamics l>y showing the task flow. • 



1 Introduction 

This document assumes basic knowledge of the computing Grid concept, which 
is a paradigm for the modem distributed compuUng and data handling.^ For a 
genera] introduction to Grid computing the reader is referred to eg. [3J. The 
most common starting point for constructing a computing Grid is the Globus 
Toolkit^ (4). This toolkit provides a Grid API and devdoping Ubraries as well 
as basic Grid service implementations. 

The NorduGrid project is a common effort by the Nordic countries to create 
a Grid infrastructure, making use of the avaOable middleware. Through the 
European DataGrid project (EDG) [S] the NopduGrid project has had extensive 
experience with the Globus Toolkit and with deploying and using a Grid Testbed. 
During this we have found some shortcomings in the Globus Toolkit and some 
problems with the EDG Testbed architecture that we would like to address on 
a Grid testbed in the Nordic countries. In this paper we present a proposal for 
a Grid architecture for a production tcstbed at the LHC experiments. It is not 
the intent to define a general Grid system, but rather a system specific for batch 
processing suitable for problems encountered in Higji Energy Phyacs. Interactive 

^ The term Nordic covers the countries: Denmark, Norway, Sweden and Finland. 
^ Globus Project and GtebuslbolWt axe trademarks held by the Univer^ty of Chicago. 

J Fagerholm ct al. (Eklsu): PARA 2002. LNCS 2367, pp. 76-86. 2002. 
© SpringetwVertag Berlin Heidelberg 2002 
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and is usually realized by a NFS server. A remote SE is usually a stand-alone 
ma<^ine running eg. GridFTP [6] server with local file storage. DaU replication 
is done by services running on the SE. 

A dedicated pluggable GridFTP server has been developed tot use on the 
SE. At the moment a simple file access plugln exists. The main reason for this is 
to have a way to provide a consistent certificate-based data access to the data. 
At least one other Grid solution to a certificate-based filesystera exists [7]. One 
advantage of the GridFTP approach is that it is done entirely in user space and 
thus is very portable. 

2.3 RepUca Catalog - RC 

The information about replicated data is contcuned in the Replica Catalog (RC). 
This is an entirely add-on component to the system and as such is not a require- 
ment. 

2.4 Information System — IS 

A stable, robust, scalable and reliable information system is the cornerstone of 
any kind of Grid system. Without a properly working information system it is not 
possible to construct a functional Grid. The Globus Project has laid down the 
foundation of a Grid information system with their LDAP-based Metacomputing 
Directory Service (MDS) 19]. The NorduGrid information system is built upon 
the Globus MDS. ^ * ^ 

The informatk>n system described below forms an integral part of the Nor- 
duGrid Testbed Architecture. In our Testbed, the NorduGrid MDS plays a 
central role: all the information related tasks, like resource-discovery. Grid- 
monitoring, authorized user information, job status monitoring, are exclusively 
implemented on top of the MDS. This has the advantage that all the Grid in- 
formation is provided through a uniform interface in an inherently scalable and 
distributed way due to the Globus MDS. Moreover, it is sufficient to run a single 
MDS service per resource in order to build the entire system. In the NorduGrid 
Testbed a resource does not need to nm dozens of diff^ent (often centralized) ser-' 
vices speaking different protocols: the NorduGrid Information System is purely 
Globus MDS built using only the LDAP protocol. 

The design of a Grid information system is always deals with questions like 
how to represent the Grid resources (or services), what kind of information 
shoxild be thwc, what is the best structure of presenting this information to the 
Grid users and to the Grid agents (i.e. Brokers). These questions have their tech- 
nical answers in the so^alled LDAP schema files. The Globus Project provides 
an information model together with the Globus MDS. We found their model un- 
suitable for representing computing clusters, since the Globus schema is rather 
single madiine wiented. The EDG suggested a different CE model which we have 
evaluated |8l. The EDG*s CE-based schema fits better for computing clusters. 
However, its pracUcal usability was found to be questionable due to improper 
implementation. 
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Fig. 1. The Norway branch of the NorduGrid MDS tree 



2.5 Grid Manager - GM 

In our model, job management is handled by a single entity which we call the 
Grid Manager (GM). It is the job of the GM to process user requests and prepare 
them for execution on the CE. It also takes care of post-processing the jobs before 
they leave the CE. In the Globus Toolkit context, the GM takes care of what 
is normally done by the Globus job-manager. In fact it is installed in a similar 
way to a standard Globus jobmanager and can work perfectly together with 
already existing jobmanagers. Authorization and authentication Is still done by 
the Globus gatekeeper. 

The status of each job is recorded in a special status directory which also 
contains contr^ files needed by the GM. 
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Fis- 3. Example nordugrid-pbsjob entry 



all job requests as well as data payload had to pass through, would be a single 
point of failure and noo-scalable. 

The NorduGrid UI is at present command-line driven, while a web based 
solution is foreseen in the future. The UI is responsible far generating the user 
request in a Resource Spedficaiion Language (RSL) based on the user input. The 
RSL we use has additinal attributes to those provided by the Globus Toolkit 112). 
This xRSL has been enhanced to support enriched input/output capabilities and 
more specification of PBS requirements. All unneeded Globus attributes has been 
deprecated. 
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Fig. 4. NorduGrid task Aow 



10. User Interface may cancel jobs by sending cancellation commands through 
the Gatekeeper to the Grid Manager. The Grid Manager will then take care 
of the job dean up 
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ABSTRACT 

This paper describes staged simuUxdon^ a technique for 
improving the run time performance and 'scale of discrete 
event simulators^ typical wireless network simulations are 
linuted in speed and scale due to redundant computations, 
both within a single simulation run and between successive 
inns. Staged simulation proposes to reduce the amount 
of redundant computation-within a. simulation by restruc- 
turing discrete event simulators to operate in stages that 
precompute, cache, and reuse partial results. This paper 
presents a general and flexible framework for singing, and 
identifies the advantages and trade-offs of its application 
to wireless network simulations. Experience with applying 
staged simulation to the ns2 simulator shows that it can 
improve execution time by an order of magiumde in typical 
scenarios and make feasible the simulation of large scale 
wireless networks. 

1 INTRODUCTION 

The design and evaluation of distributed systems and network 
protocols relies to a laige extent on netwcnic simulation. Tra- 
ditional networic simulators, however, do not rum efficiently 
' or scale well with increasing simulation size. 

A significam source of ineificienc^ in discrete event 
simulators is redundant computation. We identify two dif- 
ferent classes of redundancy in traditional discrete-event 
simulators. The first class of redundam computation occurs 
within a single nm of the simulator. Traditional neiwoik 
simulators reevaluate complex functions whenever their re- 
sults may have changed, even though in reality the results 
may have changed very little, if at alU since the last time 
tfaey were evaluated. A second class of redundant compu- 
tatitm stems from a lack of retained information between 
multiple runs of the simulator. Executing each si|nuIation 
independently and without the benefit of past runs leads to 
compuUng many functions from scratch in each ruru These 
two sources of redundancy pose significant bottlenecks fior 



wireless network simulations,' where network parameters 
change freqtiemly. 

This paper introduces staged simuladan, a general tecb- 
nique to improve the scale and perfonnance of wireless 
network simulation by exposing, identifying, and eliminat- 
ing sources of redundant computation. Staging involves 
restructuring the events in a disciete-cvent Emulator into 
an equivalent set of sub^omputations, caching their results, 
-and reusing, them. whenever matches are identified. We 
introduce three techniques, called Junction decomposition^ 
refinemenS^ and batching to complement function caching 
and improve its effectiveoess. We aj^ly diese techxiiqucs 
both within a single simulation, a technique called rnrm* 
simulation sujging, and between multiple similar runs of 
the simulator, called inter-^simuiation staging. 

We have applied staging to the event processing engine 
of ns2 (VINT 1995X a well-established simulator whose 
design is typical of many discrete event simulators. Staging 
improved execution time by an order of magnimde over the 
standard ns2 implementation under typical simulatian sce- 
narios. As a natural consequence of eliminating reduiKiant 
computation, staging in ns2 also reduced the running time 
from 0(n^) in the size of the-simnla^ wireless network to 
0(ft), making feasible large scale simulations widi tens of 
thousands of imdes. Staging maintains strict oompafibUity 
with existing simulation scripts and extensions, with no 
loss in simulator generality or accuracy. Mome advanced 
and specialized simulatioo engines can benefit eqitally bom 
staging. Specifically, we expect to see a comparable speedup 
and improvement in scalability in paiallel and distributed 
wireless network simulatois. - 

The contributions of this paper are as follows. First, 
we identify and expose a general technique for improving 
discrete event simulator performance. Second, we show * 
how comjnon simulation scenarios can benefit suhstantiaDy 
from our optimization techniques. These bertefits include 
drastically reduced simulator nm time and good scalability 
wjihout changing the simulator interface or degrading result 
accuracy. Finally, we validate bur technique through system- 
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atic applkation to wireless simulation in a weU-csiablidied 
netwoifc simulator. 

2 THE SIAGING APPROACH 

The goal of staging is to eliminate redundant or neariy re* 
dundant computations in simulations. Traditional wireless 
simulators perfonn many redundant computations within 
a single nin. Examples of common redundancies include 
, sending packets along a particular path or computing neigh- 
bor sets. Similaxly. across multiple runs of a simulator we 
find a lai^e overlap in computation, especially when numer- 
ous runs of a simulation are made with only sHg;htIy varying 
parameters. For example, studies of proposed ad hoc iDUt- 
ing protocols typicaUy call for several sets of simulation 
runs, each set evaluating the effect of a single piotoool or 
topology parameter (sec, for example, Broch ct al. 1998 and 
Royer and Toh 1999). In all many dozens or hundreds of 
runs might be executed with veiy similar input paiametecs. 

The simplest, most fondamenial technique for clinu- 
nating redundant computations is function caching. This 
space-for-time trade off involves caching the results of jdem- 
potent functions and later icusing those results whenever the 
same function is invoked with the same inputs. Whilefunc- 
tion cachiag forms the foundation for staging, it, by itself, is 
not sufficient to realize peifbnnance gains in practice. 1Vp> 
ical events in discrete event simulators have time-varying, 
continuous inputs, which preclude matching funcdon inputs 
between calls. 

Staging significantly improves on function caching by 
introducmg three techniques, caUcdyimcrion decomposition, 
refinement^ and hatMig. These techniques reslr^jcture com- 
putations such that their results are reusable even when a 
change in inputs would normally preclude reuse. 

Function decomposition splits a large computation is 
split into several smaller sub-compuunions that are each 
dependent on only a subset of the inputs to the mginal 
computation. By carcfuUy choosing the decomposition, we 
can reduce or eliminate the dependency on frequently varying 
inputs. For example, replacing a function /(x, y, r ) with 
an equivalent, decomposed version /'Cs{x, y), r) can allow 
S(x,y) to be cached and reused even when the parameter 
r varies between calls. 

Refmemem further expaiids the applicability of function 
caching by taking advantage of the continuity of the physical 
mode] imderlying the con^xitation. When a small '^ ^ ^r 
in inputs is expected to lead to little or no change in the 
computed results, computing bounds then refining them 
to precise results can be more efliciem than computing the 
• same result from scratch. For instance, computing upper and 
lower bounds on node mobility may allow the simulator to 
eliminate costly compuiadons to determine ndghboilioods. 
In this case, the upper and lower bounds are computed such 



that they are valid for a range of inputs and so can be cached 
and reused even when inputs vary slightly between calls. 

The third staging technique, batching^ leorders die com- 
putations' within the simulator so that many independent, 
fine-grained computations can be executed more ef&detitly 
in a single pass. FnnctioD decompositicm and tefinement 
both trahsfonn the event stieam in a emulator into an 
equivalent, but much finer grained, sequence of computa- 
tions. Many of diese computations are not time dependent, 
and so can be reordered without affecting simulation accu- 
racy. Batching groups related computations together, and 
r^laoes them with a single computation which computes 
aU the needed results efficiendy in a single pass.. Batching 
not only allows die utilization of more efficient global algo- 
rithms instead of independent local computations, but can 
also improving processor end memoiy cache perfonnance 
by impfx>ving locality. * 

Staging fundamentally involves a space-duke trade off. 
For staging to be worthwhile, the target computation most 
be more expensive than the cost of storing and fetching 
cached results from a potentially large ublc. Additionally, 
the cached results will likely increase the amount of memory 
required for the simulation, due to the cost of storing the 
cached results. Although this increase in memory use may 
increase virtual memory paging by increasing the working 
set, it may conversely reduce the working set by eliminating 
memoiy intensive computations. 

The remainder of this paper illustrates the use of staging 
in a widely used network simulator under typical ngn yi 
scenarios. We give examples of existing, ad hoc applications 
of staging in current state of the art simulators, identify new 
opportunities for staging, and evaluate the effectiveness of 
both intra- and inter-simulotion staging in a uhiqaiKms and 
mature networi^ simulation engine. 

3 TRADITIONAL WIRELESS SIMULATION 

Efficient arid scalable wireless network sinuilatois are critical 
to network research, but piesent unique challenges in their 
implementation. They differ from odier simulators in several 
key ways, each of which intioduces redundam computation 
at funtinoe. As a resuh, many commonly used wireless 
simulators are slow and do not scale gracefully with netwoik 
size. 

The fundamental reason redundant computation is 
prevalent is diat wireless mobile networks have highly dy- 
namic characteristics, which imply diat simulation state must 
be recomputed dynamically and often. As nodes move about 
a simulated field, Uie network-level topology niay change 
rapidly. Link characteristics, routing information, and ncir 
work topologies must be maintained and recomputed during 
die simulation, and mobile nodes must continually update 
their positions in order to provide accurate information to 
the netwoik model. In addition, complex physical models 
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make wiirless simuladon expensive. Since wireless is a 
broadcast medium, a straightforu'aiti simiUatiofi approach 
ueats the nct«-ork as a single broadcast LAN, incuirlBg 
Oir?) run time in a network with n active nodes. 

Existing wireless network simulators address some of 
the challenges of wireless oetworiu. These range from 
general-purpose simulators, such as ds2 and OpNel (Qiang 
1999X u> special-purpose and custom simulatofs including 
SWiMNetCBoukeicheelaL 1999), MobSim-f-f (Uljenstam, 
Ronngrcn, and Ayam 2001X DaSSF (Liu et at. 2001). 
and GIoMoSim (Zeng, Bagrodia, and Gerla 1998). Tliese 
simulators have widely varying designs, including paiallel or 
distributed event engines and specialized language features. 
Distributed simulators achieve scalability and performance 
by rccniiting multiple simulator hosts. Even in such systeuis, 
each simulator host may perfomi a large amount of redundant 
computatioo that can be eliminated to improve efficiency. 

We chose to study wireless simulation in the ns2 network 
simulator because it is widely used in academic research, 
and because it is has a well-established and validated set of 
protocols. The protocol implementations in ns2 total over 
1 50,000 lines of code, and provide accurate iZKXlels for node 
mobility, wireless energy consumption, radio propagation 
and MAC-layer protocols. 

N*s'2'lcirdriori>c^(5W"aia"scale poorly with increasing " 
number of nodes. As we show in the following sections, 
staged simulation can drasticaUy reduce the ainount of work 
required to simulate a wireless system by reducing redundant 
computation. These results are not specific to n$2, but can 
be applied likewise to more advanced simulation engines 
as welL 

4 STAGED SIMULATION IN NS2 

In the baseline ns2 implementatios, tlae wireless pbysi- 
. cal layer and mobility nx>dels are the largest consumers of 
processing time in typical simulation scenarios, Tlieseoom- 
pone^ pose the most significant bottlenecks to efficiency 
and scaling. Cbnsequently, we focus on staging compn- 
tations related to node mobility and the wireless physical 
layer. 

We tncrementally descril>e four differem types of stag- 
ing, each employing a difTereot approach to eliminating 
redundant computation. The first is an example of reusing 
conunon intermediate results across function calls. The 
second den3oii5trates the use of resuucturing to enlarge the 
overlap in compuiatioo across calls. The third optimization 
illusurates precomputation as a staging technique^ and the 
final one demonstrates inter-simulatioa staging by reusing 
results across inaltiple runs of the simulator. 



4.1 Grid-Based Neighborhood Compotatioo 

For staging to be effectivet redtmdant computations need to 
be readily identifiable. The monolithic structure of the de- 
fault Ds2 implementatioii, however obscures the redundant 
computations it performs at ruiStime. Specifically^ ns2 in 
particular, and wireless network simulators in genera], per> 
form numerous calculations to ultimately determine the set 
of nodes that will receive a given packet. These calculations 
depend on the positions of sending and receiving nodes, 
packet transmission and detection power levels, geography, 
and radio and antctma models. We note that many of these 
. inputs will be identical or similar across computadons, and 
show in Section 5 that the resulting redundant operations 
are significant and lead to non-linear scaling with network 
size. ^ 

To expose parts of this redundancy, we first apply a 
very simple grid-based staging approach where we reuse 
previously computed power levels for nearby nodes. We 
first divide the coordinate space into a grid of buckets, with 
each bucket holding a list of cKxles positioned within the 
corresponding grid rectangle. This data structure can then 
be used to quickly determine if a group of nodes fidls eotiiely 
outside the possible transmission range of a node, thereby 
■ eliminaDng' tfa~e ol^~to^afdntrindi\^ual~ca^ for 
each node. Nodes in the rcnuiining buckets, which may or 
may not be iii range, are checked individually as before. 
In Older to maintain the grid as nodes move during the 
simulation, we compute all of the times at which a node 
will mss a grid boondaiy, scheduliqg events m these times 
to update the grid as needed. 

While grid-based decomposidon in simulatQrs is not 
novel, it serves as an initial application of staging that 
enables o5 to identify and eliminate other redundant sqTpli- 
cadons through more advanced applicati«ts of staging in 
the subsequent sections. Nevertheless, grid-based neigh- 
borhood computation employs staging in two distinct ways. 
First, by grouping nodes into buckets, the simulator can 
reuse a single compmed result for aU nodes within the 
bucket Bmhermore, ^mct the grid data structure will re- 
main fixed across many packet transmissions, we can share 
and reuse a single global grid structure. We assume here, 
as is typical typical in ad hoc network researdi, that all 
nodes use uniform and constant transmission and rccepdon 
parameters. This assumption does not present a limitation of 
the staged simulation approach, but simplifies our examples 
considerably. 

4.2 Neighborhood Cadung 

Variations on the grid approach allow more advanced appli- 
cations of staging using auxiliary computations lo reduce 
redundancy in computation aciDss packet transinissiims. In 
typical simulation scenarios, inter-packet spacing is very 
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short ID comparison to the speed at which codes move. 
Depending ob node mobility and traffic patterns^ many 
hundreds or tfiousands of packets may be transmined ham 
a smgle node before nodes mofvc a sigm'ficant distance* That 
is, we should expect the inpots to, and hence the results 
oi; the neighborhood computation for a node to be reusable 
across many pad:et transmissions. 

Since inputs wifi vary slightly, we should not expect 
the neighborhood set to be identical to that computed during 
the previous packet transmission. However, a conservative 
upper-bound, or superset, of the neighborhood set will re- 
main valid for some time after h is computed* depending 
on the amount of node mobility and the tightness of the 
bound. This holds similariy for a lower-bound or subset of 
the neighborhood set We therefore restructure the ndgh- 
bcffhood set computation to first compute upper and lower 
bounds on the result, then refine these bounds into an exact 
result. After restructuring the compulation, intra-simulation 
staging is used to cache and reuse the common intermediate 
results, the two bouiids, across many packet transmissions. 

This lestructiuing introduces one additianal parameter, 
Ar, to control the caching policy. This parameter fixes 
the desired epodi duration for whidi the bounds on the 
neighboriiood set will be valid. If s„ajt is the maximum 
possible node speed In the movement scenario, then the 
maximum change in distance between two rKMles in an 
epoch is just Ar ~ 2iknojr Ar. If two nodes are within 
distance r — Ar at some time, then they will remain within 
range r for Ar seconds into the ftimre. Similarly, nodes 
beyond distance r Ar need not be considered at all for 
Ar seconds into the ftituie. 

We maintain a cache to capture the upper and lower 
bounds on the neighborhood set of each node. At most one 
cache entry is maintained for each node in the network. 
A cache entry, illustrated in Hgore 1, is composed of an 
expiration time and two sets, J^r-Ar and Nr±^, containing 
lists of the iKxles within a bail of radius r — Ar and those in 
the annulus with radii r ± Ar. During packet transmission, 
the cache manager computes the set of nodes within range 
of a given node by first looking for a valid cache entry. 
Finding an entiy that has not yet expired, it can iiiuhediaidy 
consider all nodes in the list M^at to be within range. 
The second list A^^at is then scanned, and ea^ node 
found to be within range is. appended to the final result 
At the same time, it can cheaply but conservatively update 
the lists, moving some nodes fiom Maai- to A/^-Ai- and 
eliminating others from Mr±^ entirely. If, on the other 
hand, no cache entry is found during packet transmission, 
the cache manager consults the underlying mobility (grid) 
manager and constructs a cache entry with expiration Ar 
seconds into the future. 

In the above caching scheme, there is some additional 
overiiead during cache misses, when computing Afr±Ar, 
since a larger radius is considered than previously necessary. 




Hgure 1: Computing Bounds on Node Movement Enables 
the Simulator to Examine Only the Nodes Located in an 
Annulus ^r±Ar During Packet Transmission by Node at 
Center 

This overhead is controlled dicectly with the parameter Ar, 
which fixes the longevity and the accuracy of cache entries. 
In addition, there is overiiead associated with scanning the 
list of nodes in J^r±Ar during each cache hit, but this is 
also limited by appropriately choosing the Ar parameter. 
We analyze these overheads in Section 5. 

-43. Perfect Caching 

There is a large overlap in computation when constructing 
cache entries for nodes using the neighborhood caching 
scheme. We use precomputation to address this redundancy 
by computing many cache entries simultaneously. When 
constructing a cache entry, a node normally examines all 
nodes within a potentially large radios. If many nodes in a 
reasonably dense network are active, and eadh periodicaUy 
construct cache entries on-demand and indqiendently, each 
pair of nodes will eventnaOy be considered twice. 

A staged simulation approach, which we tenn per- 
feet cachings eliminates reduixiancy by precomputing all 
cache entries simultaneously. This approach rnaintains the 
same data-structures as neighborhood caching. But rather 
than calculating cache entries on-demand, it precomputes all 
cache entries at the beginning of every Ar epoch. All normal 
queries for neighborhood information are then guaranteed 
to hit the cache. There are several possible advantages to 
precomputation. First we only need to examine each pair 
of nodes at most once, rather than twice, to compute all 
of the entries. Second, the positions of all nodes can be 
updated a single time at the staxt of the generation process. 
Previously, it was necessary to update the positions of all 
nodes within range of the sender during each cache miss. 
Finally, memory locality should improve when preoomput* 
leg all entries simultaneously as compared to individually 
on-demand. 

The overhead of this technique is a scheduled event 
during each Ar epodt and possibly some wasted compu- 
tation if some nodes do not send packets during an epoch, 
and thus do not use their cache entries. In a sparse or 
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quiet network, perfect caching might coiutnict more entries 
thaD needed during the simulation. This problem can be 
addressed directly by appropriately choosiiig the A/ epoch 
paramtier 

4A On-Disk Cacfaing 

A final inter-simulation staging application improves on per- 
fect caching, and demonstrates how staging can be sipplied 
across multiple similar runs of the simulator. The iotra- 
simiilatioD examples above reduce the amount of compu- 
tadon significantly, but also add some additional events to 
the event queue leading to more work in the event sched- 
uler. Event queue management is a well-studied problem, 
espedally an the particular case of the Calendar Queue 
used in ns2. However, we can eliminate the woiic done by 
many events by looking at a set of simulation runs together. 
This application of inter-simulation staging therefore builds 
on the previous optimizations by reducing Che number of 
scheduled events generated by the grid manager and the cost 
of constructing neighborhood cache entries in the perfect 
caching scheme. 

First note that, by itself, perfect caching generates 
strictly more events than the on-demand caching approach, 
and may actually compute results that are not used in any 
particular simulation run. But, also observe that perfect 
caching will perform identical work during multiple simu- 
lation runs using the some mobility scenario. In the second 
and subsequent runs of the simulator we can eliminate these 
extra events, as well as most cache maintenance, by writing 
all cache entries to disk every seconds during the first 
simulator run. Subsequent runs can obtain cache entries 
from disk rather than maintaining an underlying cache man- 
ager or grid. This technique then introduces twro phases. 
The generation-phase is identical to perfect caching except 
that all cache entries are spooled to disk. Hie use-phase 
does not maintain a grid, does not need to track changes to 
node positions, and requires no scheduler onents. Instead, 
cache entries. are read from disk serially as needed during 
packet transmission. A set of runs widi the same mobility 
model will ose the more expensive generation-phase for the 
5rst run, and the less expensive use-phase for all remaining 
nins. 

5 EVALUATION 

Wc have implemented each of the optimizations detailed in 
Section 4 in the ns2 simulator. Wc find that even the simjplcst 
application of staging reduces the run time of the simulator 
significantly, and allows for practical simulation of much 
larger network sizes than previously feasible. Wc show that 
more advanced intra-simulation techniques improve stability 
and robustness of the simulator, while the application of 
inter-simuladon staging improves performance yet further. 



With the latest staged implementation, we regularly simulate 
netwQilus of over 1000 nodes in the dme it previously took 
to simulate networics of huiKlreds of nodes. 

In addition to cvahsating total simulation run time ttsing 
OUT techniques, we also characterize the effect of each 
parameter we have introduced. EVxr staging to be effective, 
it must be possible to ea^ly or automatically find near- 
opttinal choices for these parameters and, at Ihe very least, 
avoid paiameier choices that would lead to ran time behavior 
worse than the default, non-staged implementation. We first 
describe our test envirorunem and changes required to add 
staging to the simulator, then present the results of our 
staging techniques. 

5-1 Evaluation Platform and Environment ' 

We take as our baseline a modified ns2 version 2.1b9a simu- 
lator. All simulations were completed on a single-processor 
machine equipped with 1.7CHz Pentium 4 processor and 
256MB of physical memory. Physical memory is an impor- 
tant constraint in ns2: more generous machines can simulate 
proportionally larger networics before becoming n^moiy- 
limited. Before implementing our staging techniques, we 
made a few non-standard modifications to improve the base- 
line ns2 code. Most notably, we disabled all unused packet 
headers to reduce packet sizes aixS improve tnemoiy local- 
ity, and implemented more efficient packet tracing. This 
improved run time by 85% for a 250 node network. The 
performance results detailed in the this paper are computed 
relative to this optimized ns2 baseline implementatifm. 

Staging can impact the performanoe of a simulator 
by introducing line-giain events and changing the event 
distributitm observed by the event schedulec Calendar 
queue schednleis are particularly sensidve to such pertur- 
bations (Oh and Ahn 1999). To counteract the sensitivity of 
d>e calendar queue scheduler to the event distribution, we 
modified the calendar queue event scheduling algorithm to 
re-optimize die event queue after 30 seconds of simulated 
time, efifectively avoiding occasional mis-predictions by the 
scheduler. 

Overall, our simulation runs closely resemble those 
discussed in Broch et aL (1998X a very comizxm setup. 
We used standard CMU Monarch mobility and communi- 
cation model generators iiora the standard ns2 distribution. 
As an exemplar of typical wireless network research, we 
chose the .AODV ad hoc routing protocol implementation 
included with ns2. Our results are not specific to these 
choices of application, mobility model, or conununicatioa 
pauem. These system parameters, suorunarized in Table U 
closely follow the standard values used in ad hoc networkiitg 
literature. Aldx>ugh the tK)nunal reception raxlius for our 
antenna model is only 250 meters, we use the transmissioa 
detection radius of 531 meters for all optimizations in order 
to properly acooum for interfereace effects. 
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Table 1: Default Simalatxoa Parameters for Experiments 



Network load 


model 

coQctnTent data streams 
packet size & rate 


Comtanc bit rate 
30 

512 bytes x 8 packets/s 


Node mobility 


model 

maximum node speed 
pause time 
field density 


rasdom-waypoint 

5m/5 

10 s 

nodes 


Simulation 


routing prtHocol 
simuladoD time 


AODV 
4008 



5JE Simulator Ferfonnance 

We first examine how the diiTerem applications of staging 
affect total simulation execution time using a 1000 node 
network. Id this experiment, we fix grid granularity at 
250 meters and at 2 seconds* and later describe their 
selection and the sensitivi^ of staging to these parameters. 
We run our simulations with various applications of staging 
enabled, as shown in Table 2. For each level of stsigmg. 
we run the simulator cm five randomly generated netwoiks 
and present the average of the execution times. The san^le 
standard deviation for each data point is less than 0.2%. 

Table 2: Levels of Ks2 Optimization for Experiments 



Level 


Optinuzations 


Lo 


Ns2 baseline: improved tracing and packet size 


Intra-simuiation staging 


Li 


Lo Grid-based 

L\ •f Caching 

Z.2 + Perfect caching 


Inter^simulation staging 


L4b 


Z.3 + On-disk caching (generation) 
L% ^ Qn-disk caching (use) 



The speedup achieved by increasing levels of staging 
relau've to the baseline simulator is shown in Hgoit 2. 
These results, obtained -using a 1000 node nerwork, high- 
light especially tbe benefits of Use simplest Inu^-simulation 
staging Li techmque and of. the inter-simulation staging 
technique. Opdmization level £.4^, ihe second phase inter- 
simulation staging approach, impiDves simulation run time 
significantly in comparison to using only intra-simulation 
techniques. Also, the one-time cost of the first phase, L40 is 
no worse than the best possible intra-simulation technique 
£3. Thus» in this case inter-simolau'on staging imposes no 
additional cost during the first run of a series, but offers a 
signiikam speedup during subsequent runs. 




^2 Ly Lao LAb 
Figure 2: Speedup in Execution Time with Increasing Stag- 
ing Relative to Baseline Ns2 Implementation using a 1000 
Node Network 

S3 ScaHng with Network Size 

In order to evahiate how sta^g affects sinmlation scale, 
we simulated networks with vaiying number of nodes while 
holdii^ the application-level load constant and increasing 
the field size to maintain a constant iiode density. 

Rgure 3 shows iliat staging can improve die scalabil- 
ity of wireless simulators by reducing redundant compu- 
tation& This experiment also demonstrates the benefits of 
inter-»mulation staging, which achieves 5656 improvemem 
over tbe intra-simulation staging techniques in 1000 node 
networks. Although the differem intra-simulation staging 
approaches show similar perfomuuice in this experiment, 
they exhibit different behaviors as optimization parameters 
or network charactetisdcs change. As we show in the next 
two sections, the more advanced optimizations offer in- 
creased robustness and stability, an advantage not evident 
in Figure 3. 

Addidonal experiments indicate similar performance 
benefits using networks of varying density, up to more than 
twice the density used above. Very dense networks, how- 
ever, expose a trade-off in our disk-based inter-simulation 
optimization. In our implementation, cache entries m 
stored on disk during the first simulator tun, and most be 
read from disk and processed during each subsequent run. 
While most of these disk accesses are easily pipelined and 
dispatched in the background, there is still a non-negligibie 
CPU cose for disjpatchii^ and processing data stated on 
disk. As ttetworit density increases, the cache entries giow 
larger and cache processing may become more expensive 
than simply recomputing results from in-memofy data. 

Tbis trade-off is present to some extent in any lesult 
caching scheme, and designers must be careful that cache 
overhead is less than the cost of recomputation. But in prac- 
tice we find that only the disk-based caching optimization 
might impose a sigiuficant processing overhead, for certain 
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Number d nodes (constant aansky) 
Figure 3: Effect of Netwodc Size on Total Simulation Run 
Time Holding Node Deosity Constant 

networks, and that the optimization ofleis a net improvement 
in petfonnanoe for DCtwoiks of reasonable density. 

5.4 OpUndzatlon Parameters 

It is imponant to characterize the eflcct of any new simula- 
tion parameter Introduced by our optimization techniques. 
We study simulation performance under various choices for 
optimization parameters and examine the robustness and 
stability of the different optimization levels. Recall that the 
gnd-ba5ed intra-simulation approach introduces a granular- 
ity parameter, and the caching intra*simulation opprDacb a 
lookahead parameter. 

We fifst evaluate the effect of varying grid granularity 
on each level of staging. Intuitively, it is clear that a very 
fine granularity will give rise to many grid-crossing events 
as nodes move about in the topology, and also leads to more 
work in packet transmission* as many empty bins will be 
scanned for nodes. Conversely, a very coarse granularity 
reduces to a single bucket and. essenu'ally. a scan over all 
nodes during each packet transmission or cache miss. A 
reasonable choice is to use the node txansmissioo laditis, 
which requires a scan of roughly nine buckets during each 
transmission or cache miss. 

We run, the simulamr on a single 250 node network 
with the same con6guraiion as before and At fixed at 2 
seconds, but vary the grid granularity. Rgure 4 verifies 
our intuitive description of the effects of grid ^anularity. 
Interestingly, we 6nd that any choice of granularity other 
than the two extremes yields a substantial improvement 
in run time under Li staging, with only minor variation 
between 500 and 2000 meicrs. with the optimum choice 
approximately 1500 meters. 

In this experiment, even the nght-most extreme pcr- 
fomis much better than the ns2 baseline implementation 
since we avoid creating events and copies of ihe packet for 
nodes outside the transmission range. Further, much of the 
degradation due to a poor choice in granularity is mitigated 
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Figure 4: Effect of Varying Grid Graoularity oo Simulation 
Run Time 

by the use of the higher levels of staging. In these cases, 
the pooriy-tuned grid is consulted only in the rare case of 
a cache miss. 

The choice of grid granulari^ depends on the particular 
choice of node mobility and load patterns. In practice, we 
find that the optimal choice of granularity can be as low as 
250 meters, but is rarelyhJgher than 2000 meters. In all cases 
we have examined, the trends are similar to those presented 
above, making automatic tuning a feasible approadi. 

5^ Caching Lookahead Parameter 

The overhead of constructing cache entries is controlled 
by the At parameter to the neighborhood caching lootine. 
Recall that At specifies the desired expiration time when 
constructing a cache entry. A laiger value means that a laiger 
radius must be examined to build a cache eony, leading 
to a lai^ger data structure* but allowing the cache eatiy to 
remain valid for longer. We set up our simulator as the 
previous experiment, but fix the grid granularity at 250 hl 

Figure S shows how At controls the cache hit rate 
(top), and the sizes of the two neighborhoods sets A/'r-zu 
And M^Ar stored in cache entries (boUom). We only show 
the results for Li caching; those for perfect caching and 
the first phase L4a of intia-simulation staging are identical. 
For reference, the actual avemge neighbor set size for queries 
is shown as constant AC* 

The overheads associated with caclung are Limited by 
the cache hit rate and A^^r. A very small value for At 
leads to many cache misses, each of wtiicb is potentially 
expensive. Com'crsely, a large value for At forces both 
cache hits and misses to process a larger set Mr±^. The 
cache is effective for reasonable values of At^ roughly 2 
to 4 seconds, with high hit rate but still reasonably sized 
/>fr±^r' The curves for the neighborhood set sizes can be 
explained geometrically based ou the known transmission 
radius, and the average omnber of neighbors of transmitting 
nodes. The cache hit rate is a function of the average inter^ 
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Figure 5: Effect of Vaiying Caching Psajamcter A/ on Cache 
Hit Rate and N^ghboifaood Sizes 

packet sparing. While our impIcmentatiiMi does not pick A/ 
automaticaJly. the figure shows thai a near-optima] value for 
parameter A/ can be computed as a function of the packet 
rate, node density, and transniission radius. 

Surprisingly, even with such vaiying cache behavior 
there is very litUc overall change in total simulation run 
. time. Funhcr experiments indicate that over the entire tange 
of values in figure 5, run time varies by at most 5% over 
the range of Ar values shown. The L2, Lb, and £4^ staging 
levels all perfonn similarly, while the second phase L4i, mter- 
simulation approach improves run time by ap|»oximatcly 
30% as compared to £3, independent of the Af parameter. 
As with tfie grid granularity, nearly any reasonable choice of 
parameter will woric well for the highest levels of staging. 

6 RELATED WORK 

Several important examples of staging can be found in 
existing simulators. In our analysis of the ns2 implemen- 
tation, we idcDii6cd applications of staging, but find that 
the technique of staging is not viddely applied in the imple- 
mentation or recognized in the literature. There has been 
no prior recognition or development of the technique of 
staging as a general approach to simulation optimization. 

The Nix Vector (RDcy, Ammar. and Fujimoto 2000) 
approach improves wired-nctworic routing efficiency in the 
ns2 simulator by computing and caching routes on demand 
rather than maintaining a complete routing table. This 
approach has not been applied between muluple nms of 
ihe simulator, nor docs it elhninate redundancy when inputs 
vaiy slightly between computations. 

A second example from ns2 is a grid implementation 
very similar to our Li staging. A key difference Is that we 
ex pose and explore the parameter space of grid granularities, 
while the previous attempt uses a hard-coded granularity of 1 
meter. In typical scenarios, this choice leads to perfomsance 
worse than the baseline. Similarly^ Wu and Bonnet (2002) 
propose an alternative packet transmission routine for ns2. 



essentially equivalent to our ti staging with granularity 
parameter 00. Again, we have shown that this choice of 
granularity is particularly inefficient as compaied to neariy 
any other choice. These examples illustrate the imponance 
of properiy chardcterizing staging parameters and relating 
them to system variables such as the tFansmission radius 
and expected number of neighbm. 

In the context of discrete event simulators, we find occa- 
sional use of staging or sunilar techniques to improve perfor- 
mance. Splitting (Glasserman, Heidelbeiger, Shahabuddin. 
and Zajic 1996), cloning (Hybinctte and Fujimoto 1997), 
and updatcable simulations (Fcrcnci et al. 2002) are three 
related techm'ques which eUminatc identical computations 
in multiple runs of the simulator. These techniques do 
not exploit redundaiu computations within a single tun of 
the simulator, nor do they address computations v^ch are 
similar but not identical. 

Boukerche et al. (1999) propose a two-pbase design for 
Personal Communications System (PCS) network simulation 
using SWiMNet. This design is used to fociUtate various 
lookahead optimizations in a paraUei simulaticm engine, 
rather than to eliminate redundant computation oir oiMimize 
multiple runs of the simulator. 

A popular technique for improving scale and perfor- 
mance uses distributed simulation (for example Boukeiche 
et al. 1999, Liu et al. 2001. and Uljenstam, ROnngren. 
and Ayani 2001), sometimes combined with specialized 
language features (for example Zcng, Bagiodia, and Gerla 
1998). These approaches are complimentaiy to our cpti- 
mizations, since staged simulation can be applied equally 
well to both distriboted and centmlized designs. Other 
techniques are used to reduce simuhttion nm time, such as 
model abstraction aitd appnxximation (Huaiig. Estrin, and 
Heidemann 1998, Gadde, Chase, and N^ihdat 2002). Our 
approach differs from model abstraction in that we do not 
alter in any way the final resuh of computations. Addition- 
ally, absuaction may not he possible if the system of interest 
has not yet developed stable or well-understood models. 

Finally, we note that staging as a concept is a general 
technique, employed most notably in compilers and iter- 
ative programming. CTiambers (2002) discusses a staged 
compilation technique tiiat combines partial precompiling 
of code coupled with dynamic optimizations at runtime. It- 
erative programming is a general franicwoik for describing 
computation. Uke staged simulation, it relies on reusing 
results, intermediate values, and extraneous values from 
previous iterations. Liu. Siollcr, and Teitclbaum (1996) dis- 
cuss mcUiods for automatically extracting this information 
using program and data-flow analysis. We find this partic- 
ular approach unsuitable for large and complex simulator 
implementations, where dau-Bow and simulation behavior 
depend very heavily on the particulars of a simulation nm. 
Additionally, the use of multiple languages compounds the 
difficulty of low-level automatic program analysis^ 
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7 CONCLUSIONS 

We propose a general techruque. teimed staged siinulatioii, 
for reducing the run lime of discrete eveni simulators. The. 
central idea is to eHminate redundant ox poniaUy-redundant 
computations typical in simulations by caching- and reusing 
the resuJts of computations. The technique consists of 
identifying redundant computation both within single runs 
as wcU as across consecutive runs of the simulator. Staging 
then relies on prccompming, caching and reusing partial 
results to eliminate redundant computation. Our technique is 
genera] and applicable to a wide range of designs* including 
parallel and distributed simulation engines. 

We show that staging is an effective technique for . 
reducing simulation run time without loss of accuracy, and 
is efTcctix'e in a wide range of simulation scenarios including 
varying mobility and conrununication patterns* network sizes, 
and node densities. We implement three levels . of intra- 
simulation staging and one level of inter- simulation stagmg 
in the ns2 wireless netwoiiding simulation system. Simple 
intra-simulatioD optimizations are found to reduce simulator 
run time by a factor of 9 and to improve simulator scalability 
from networlcs of hundreds of nodes to networks of ten 
thousand nodes. An application of inter-simulation staging 
can reduce run time even further to a factor of 21 over the 
non-staged implementatton. We find that the techniques are 
robust in the choice of parameters, and these parameters 
appear easy to estimate automatically as a function of other 
simulation variables and observed nxntime behavior, 
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Abstract 

77?^ combination of Grid technology and web ser- 
vices has produced an attractive platform for deploying 
distributed applications: Grid services, as represented 
by the Open Grid Services Infrastructure (OGSI) and its 
Globus toolkit implementation. As the use of Grid ser- 
vices grows in popularity, tolerating failures becomes 
increasingly important This paper addresses the prob- 
lem of building a reliable and highly-available Grid ser- 
vice by replicating the service on two or more hosts us- 
ing the primary-backup approach. The primary goal is 
to evaluate the ease and efficiency with which this can 
be done, by first designing a primary-backup protocol 
using OGSI, and then implementing it using Globus to 
evaluate performance implications and tradeoffs. We 
compared three implementations: one that makes heavy 
use of the notification interface defined in OGSI, one 
that uses standard Grid service requests instead of no- 
tification, and one that uses low-level socket primi- 
tives. The overall conclusion is that, while the perfor- 
mance penalty of using Globus primitives — especially 
notification— for replica coordination can be signifi- 
cant, the OGSI model is suitable for building highly- 
available services and it makes the task of engineering 
such services easier. 



1 Introduction 

A Grid infrastructure, being a collection of re- 
sources, is prone to many kinds of failures: applica- 
tion crashes, hardware faults, network partitions, and 

* Department of Computer Science and Engineering, University of 
California, San Diego 
t AT&T Research Lab 



unplanned resource downtime. Most Grid platforms 
have mechanisms for tolerating at least some kinds 
of these failures. These mechanisms typically retry 
failed executions, perhaps starting at a recent check- 
point [13, 4]. Furthermore, the need for fault tolerance 
in Grid infrastructures is well known; an overview of 
these techniques and a unifying failure handling frame- 
work, called Grid Workflow^ is given in [12]. 

Recently, however, the emphasis in the Grid stan- 
dardization efforts has moved from a focus on sup- 
porting job execution to service-oriented architectures 
that can be used not only for the traditional resource- 
intensive scientific computation tasks, but also as a gen- 
eral distributed computing platform. Specifically, the 
recent GGF (Global Grid Forum) standards define Grid 
computing platforms as collections of Grid services [9]^ 
that include services that provide the Grid computing 
infrastructure, e.g., scheduling, and monitoring, as well 
as services that comprise the Grid application itself. 
While the fault-tolerance issues have been extensively 
explored in the traditional job execution scenarios, the 
issue of handling failures in the Grid services model 
as represented by the Open Grid Services Infrastructure 
(OGSI) [10] and its Globus [8] toolkit 3 (GT3) imple- 
mentation, remains largely unexplored. Given that Grid 
services are expected to become the basis for commer- 
cial as well as scientific applications, such support is 
critical for wide-scale acceptance. 

This paper addresses the problem of building 
highly available Grid services by replicating a service 
on two or more hosts. Making services highly avail- 
able is not a new research area: it has been a research 
topic for decades, and there are commercial products 
for making non-Grid services highly available. So, one 



'a new standard, Web Service Resource Framework (WS-RF) 
works on the convergence of web services and grid services. 



might assume that some standard approach could be 
used for Grid services. Which approach to use, how- 
ever, requires some study: 

1 , While there are currently only a few examples of 
Grid services, a common theme is that such ser- 
vices are expected to be statefuL This is in con- 
trast to the closely related technology of Web ser- 
vices, which are stateless in that any changes to 
the service's state that must persist across failures 
is recorded in a database. If the machine executing 
a Web service fails, then the client can rebind to 
another machine (perhaps via a load balancer) that 
can reference the persistent state in the database^. 
Commercial products, like the Veritas cluster man- 
ager [16], also assume that a failed server can be 
recovered by restarting it on another machine: any 
persistent state is kept in files or a database. A ser- 
vice that is stateful can have a lower latency (by 
avoiding writing state to a database) and have a 
simpler client-server protocol. 

2. Extrapolating from existing services that aire part 
of Grid systems, we can expect some of the ser- 
vices to have nondeterministic behaviors. A non- 
deterministic service by itself offers no problems 
to a client. Maintaining the consistency among 
replicas, of a nondeterministic service is a prob- 
lem, though: two replicas may become inconsis- 
tent even though they execute the same sequence 
of commands. 

Nondeterminism can arise from many sources. At 
one end of the spectrum, a service's methods may 
be explicitly nondeterministic. For example, it 
is often better to randomly choose a "good" re- 
source than to deterministically choose the "best" 
resource, either because it is computationally eas- 
ier to do or because doing so spreads the load 
among several resources [3]. At the other end 
of the spectrum, nondeterminism may arise from 
- timing issues, such as when exactly an external 
event (like an interrupt) occurs. For example, a re- 
source manager service may use a lease-like mech- 
anism to reclaim resources: if, after an interval of 
time, the use of a resource is not reconfirmed, then 
the manager automatically reclaims the resource. 
Consider a resource management service with two 
replicas where a client requests a resource of some 
class C. Let there be two resources ci and C2 
of class C, and ci has been previously allocated. 
One replica may get the client's request before the 

^Somc state, like a "shopping basket", can be bound to an individ- 
ual server. Losing such state due to a server failure isn't often seen as 
being critical. 



lease for ci has expired, and so allocate C2 to the 
client. The second replica may instead get the re- 
quest after the lease has expired and allocate ci to 

the client. 

3. To make service integration easier, Grid services 
are designed using very high-level protocols and 
services. All else being equal, one would not ex- 
pect that a service built on top of, say, SOAP will 
have the same latency as a service built directly 
on top of TCP. Using the OGSI functions to pro- 
vide for high service availability is appealing: one 
could provide it as a feature portable across any 
OGSI service. But, if the performance is much 
worse than the same feature implemented at a low 
level, then performance may outweigh the engi- 
neering appeal of a high-level implementation. 

The first two points suggest using di primary- 
backup approach for service availability. Point one re- 
quires replicas to be consistent with each other. For ex- 
ample, a client could interact with replica ri, and then 
start interacting with another replica r2 if ri fails. Any- 
thing that the client knows from ri about the state of 
the service should also be known by r2. Point two 
is addressed by having only one replica,- the primary, 
respond to requests. The primary will keep one or 
more backup services consistent and ready to take over 
should the primary fail [6]. 

The primary goal of this paper is to evaluate the 
tradeoflFs associated with using primary-backup as a 
fundamental technique for building highly available 
Grid services in the context of OGSI and Globus. Much 
of this focuses on the third point above — the tradeoff 
of performance versus use of facilities provided by the 
OGSI standard. We first designed a primary-backup 
protocol using OGSI to determine whether it supplies 
the necessary features, such as state update and client 
rebinding, and to see what changes might be needed to 
support such an approach. As described in Section 2, 
we found that it is not hard to accommodate primary- 
backup, and that the solution is simple and requires only 
small changes to the service to handle non-determinism. 
The use of the OGSI notification interface to handle 
replica updates is perhaps the key distinguishing feature 
of this approach. 

We then implemented this approach using GT3' 
to better understand the performance implications and 
tradeoffs of doing primary-backup at such a high level. 
In particular, using a simple example Grid service, we 
compared the performance of this notification-based ap- 
proach to variants in which replica update is done using 
standard Grid service method calls and TCP, respec- 
tively. Our example Grid service implements a sim- 



var target //points to the current primary 

var ops // contains records of on^going operations 

INITXLIENT ( ) //called when client binds to a Grid service 
(replicai , repiica2, ...replican) find_replicas(); 
target *— replicai; 
ops 0; 

toTeachhost € {replica^-- replica do 

register_notification(&FAILURE-HANDLER, host, 
FAILURE); 

5tub.init(); 

OP (params) 

var op //object for holding operation parameters 
var done // a semaphore for waiting until success 
var result // object for holding results of executions 

op { params, &done, &result }; 

ops +— ops U &op; //add op to the list of on-going operations 
create_thread(&INVOKE.OP, &op); 

wait_on_semaphore(&done); // wait for someone to succeed in op 
ops ops \ &op; //remove op from the list 
return result; 

INVOKE_OP(op) 

var result // object for holding results of executions 

result 4— stub.op(op.params); //make the SOAP call 
if result ^ feilure then 

op.result +— result; //pass result back to OPQ 

signal(op.done); //wcke it up 

FAILUREJIANDLER (new^irimaiy) 
tai^get ^ new-primary; 
for each op 6 ops do 

create_thread(&INVOKE-OP, op); 

Figure 2. Pseudocode for the fault- 
tolerant stub on the client. 



failover duration, we don't wait for the stub,op{) to re- 
turn (it may take a while for the TCP socket to time 
out), so we re-submit the request to the new primary in 
FAILURE-HANDLER by spawning another INVOKE.OP 
thread. The parameters for this invocation are kept in 
the hst ops. Eventually, some invocation should suc- 
ceed, allowing OP to wake up and return the result. 

The OGSI model specifies a method for clients to 
deal with failures of Grid service instances. Specifi- 
cally, each Grid service has a persistent handle called 
a GSH (Grid Service Handle) and this handle can be 
resolved into a handle, called a GSR (Grid Service Ref- 
erence), for an instance of this Grid service. The GSR 
may become invalid over time and the client can reac- 
quire a valid GSR by re-resolving its GSH. The han- 
dle resolution is performed by a Grid service called the 
Handle Resolution Seryice. Although our design could 
incorporate this approach, in this paper we use a design 
that by-passes the Handle Resolution Service for two 
reasons: 

• The Handle Resolution Service, if not fault- 
tolerant itself, would provide a single point of fail- 
ure that could make all Grid services that rely on it 
unavailable. 



var ratejtending // time interval for sending heartbeats 

INIT_PRIMARV() 

claini_notification_source(HEARTBEAT); //register as source 
claim_notification-source(STATE_UPDATE); 
scheduleC&HEARTBEAT-GENERATOR, rate sending); //run 
regularly 

HEARTBEAT-GENERATOR ( ) 

. notify jchange(HEARTBEAT); //send a notification 

EXECUTE (request) 

var result //object for holding results of executions 

result 4— check.previous-requests(request); 

//if request is completed result is not NULL, but for new requests it 

is 

if result = NULL then 

var state // encoding of application state 

result ^ service. op(request.params); //execute the request 
state *~ service.extract_stateO; //obtain state of the 
application 

notify-change-with Jck(STATE_UPDATE, state); // waits for 
acks 
return result; 

Figure 3. Pseudocode for the primary 
service. 



• In the handle resolution approach, the client only 
detects the failure of the primary when it attempts 
to use its GSR. In our approach, the client is noti- 
fied inmiediately. 

The code on the replicas is interposed between the 
Grid infrastructure and the service implementation. For 
each client OP there is an implementation of that oper- 
ation on the server. To make stateful primary-backup 
replication possible, the service must implement two 
additional methods for state transfer: extract.stateQ 
and inject stateQ. Ideally, state transfer can be done 
by a small set of values describing all the relevant ap- 
pHcation state, but in the extreme it could be a full ap- 
plication checkpoint. 

On the primary, as shown in Figure 3, we inter- 
cept each one of the operations with the EXECUTE 
method. It first checks whether this request has al- 
ready been processed — this can happen when a server 
crashes after sending the state to the backups, but be- 
fore replying to the client. In that case the old re- 
sult is returned without executing the request. Other- 
wise, the request is executed, followed by the extrac- 
tion of state, which is sent to backups via notifica- 
tions. Note that notify -change jwithMck blocks un- 
til it gets an acknowledgment from every backup. In 
the initialization routine, the primary advertises itself 
as a source of two types of notifications (HEARTBEAT 
and STATE-UPDATE) and schedules a heartbeat routine 
to run regularly. 

Figure 4 shows the pseudocode for backups. Dur- 
ing normal operation they receive two kinds of no- 



var ratejchecking // time interval for checking for notifications 
var last-notification //timestamp of the last notification 
var primary JsMp // boolean flag 
var senior // this is the senior backup 

INITJACKUPO 

{replicai , replicai, ...replicun) find-replicas(); 
primary Js-up ^ TRUE; 
if myjitrlQ = replica2 then 
senior TRUE; 

registerjiotification(&HB_HANDLER, replicai^HEAKTBEAT);// register sinks with primary 
register jiotification(&STATEJL\NDLER, replicai, STATE-UPDATE); 
' c laimjioti fi cation-source(FAILURE) ; // register as a source for clients 
schedule(&FAILURE-DETECTOR, rate-checking); 
SETUP-SENI0R(rcpHca2); 

SETUP.SEMOR (senior.url) 
if senior = TRUE then 

claiin_notification-source(HEARTBEAT); 
claim_notification-source(STATE-UPDATE); 

else 

iegister-notification(&HB_HANDLER, senior-url, HEARTBEAT); 
register jiotification(&STATE_HANDLER, senior.url, STATE-UPDATE); 

FAILURE-DETECTOR ( ) 

if ( currentJimeQ — lastjnotification ) > ratejchecking then 
if senior = TRUE then 
s witch-to-primaryO ; 

notify.change(FAILURE); // notify client 

else 

if primaryJsjup = TRUE then 

primaryJs-up ^ FALSE; // wait for the next timeout 

else 

INIT3ACKUP0; // backups re-initialize, electing a new primary 

else 

If primary Jsjup - FALSE then 
primary Js-up *— TRUE; 

(repitcai, repiica2, ...replican) ^ findjreplicasO;//efec/ new jcm'or 
SETUP-SENI0R(repZica2); 

STATE JIANDLER (state) 
service. inject-state(state); 
last-notification *— cuirent-timeO; 

HB-HANDLERO 

last-notification cuirent-timeQ; 



Figure 4. Pseudocode for the backup service 



tifications: their STATE_HANDLER receives state up- 
dates and injects the state into the service applica- 
tion and their hb_handler receives heartbeats. Both 
store the current timestamp in the global variable 
lastjnotification. FAILURE-DETECTOR checks this 
variable to make sure it is not stale. If it is then the 
primary is assumed to have failed. 

Switching to a new primary can take a long time be- 
cause it needs to register as a source of notifications and 
all backups must re-bind to the new primary. If we de- 
layed client-bound failover notification until re-binding 
is complete, the failover time of our system would be 
extremely large (binding can take seconds!). We avoid 
this performance penalty by binding all backups to one 
special backup, which we call the senior backup, at the 
time of service initialization. 

If a failure is detected, the senior backup becomes 
the primary and notifies the client immediately, since 
it already has all backups registered with it to receive 



state updates and heartbeats. The remaining backups 
then chose a new senior and bind to it "off-line", with- 
out delaying processing of client requests. This binding 
is implemented in the SETUP-SENIOR method, which 
is called during initialization and during recovery. In 
the rare situation that the senior backup fails together 
with the primary, all surviving backups will assume the 
failure of the senior after the second missing heartbeat 
and they will go through full re-initialization by calling 
INIT-BACKUP. 

For simplicity, the pseudocode shows that the 
state is applied immediately by calling inject. state in 
STATE-HANDLER. In a real implementation it it would 
be better to queue up the state update, send back an ac- 
knowledgment and apply the state later, so as to impose 
as small of a penalty on the response time as possible. 
As implemented, the protocol queues state updates and 
applies them later in this way. Doing so can slow down 
failover because the backup may have to apply queued 



state messages before processing new requests. 
3 Performance 

While it appears that OGSI is a suitable platform 
for building primary-backup fault tolerance, the over- 
head of replication may ultimately determine whether 
the technique is useful in practice. In this section, we 
describe the performance of our prototype implementa- 
tion using GT3. 

Our example highly available Grid service is a sim- 
plified version of the well-known Condor Matchmaker 
service. We measured the transfer overhead, the request 
response time, and the failure notification overhead of a 
prototype service structured according to the primary- 
backup approach described above. We performed ex- 
periments on a pair of dual-CPU Pentium II 300MHz 
workstations with 400Mb of memory, running Linux 
2.4. We only considered a system with a primary and 
one backup, since this is by far the most common way 
primary-backup is used. 

3.1 Matchmaker Grid Service 

We designed the Grid Matchmaker service based 
on existing (but more complex) non-Grid services, such 
as Condor Matchmaker [14], Java Market [2], and the 
resource management tools in Globus [7]. We chose to 
use this service because it is an example of an important 
class of Grid service, and because is inherently nonde- 
terministic. 

Our Matchmaker service keeps track of machines 
available in the Grid, accepts requests for allocating ma- 
chines, and maps each request to a suitable machine. 
There are two kinds of requests: one is a resourceAd- 
vertise request, and the other is a jobSubmit request. A 
resource Advertise request provides information about 
a machine that is available for allocation. The input 
of this request is: the resource ID, the available CPU 
speed, the available memory size, the available disk 
size, the machine's IP address, and an identification 
string used to implement a simple capability for using 
the machine. A jobSubmit request sends a specification 
for a desired machine. If there are suitable machines 
available, then the Matchmaker service will choose one 
and send the address of this machine and the identifica- 
tion string back to the client. The input of this request is: 
the job ID, the required CPU speed, the required mem- 
ory size, the required disk size, the priority of the job. 
The response of this request is the address and identity 
of the chosen machine, if there is one available; other- 
wise, the request returns a null string. 



This Matchmaker service is non-deterministic for 
two reasons. First, if there are several machines that 
satisfy a jobSubmit request, then the machine that is 
allocated can be nondeterministically chosen. Second, 
the Matchmaker service is implemented by two threads: 
one enqueues requests and one executes enqueued re- 
quests. Requests are enqueued in priority order, and 
is FIFO within each priority. Two servers S\ and S2 
could behave differently because of these rules on pri- 
ority. Consider two jobSubmit requests: Vh is a high 
priority request and is a low priority request. Let 
arrive at the servers before Vh- If server Si is slower 
than 52, then may arrive at Si when it is busy and 
arrive at ^2 when it is not busy. If Vh arrives shortly 
thereafter, then Si will execute r/i before ri and 52 will 
execute before th- 

3.2 State transfer 

To fully understand the sources of overhead in state 
transfer, we compared the implementation using OGSI 
notifications for state updates (labeled Notification) to 
two alternative implementations, one that uses direct 
Grid service method calls (labeled Call) and one that 
uses TCP connections (labeled Socket), In the following 
tables we present the median, the mean, and the stan- 
dard deviation for a set of 20 round-trip measurements. 
To better understand the overhead of state updates, our 
service can be configured to send an arbitrary amount 
of data in each state update. 

First, Table 1 shows the round-trip time of a single 
state update, for a number of different state sizes (from 
10 bytes to 100 kilobytes), as measured on the primary. 
Not surprisingly, Socket always has the smallest round- 
trip time, but this advantage goes from around 200 times 
faster for lOB updates to only 1 .4 times faster when the 
state size is 100 kB. Call has intermediate round-trip 
times: at lOB, it is about 20 times slower than Socket^ 
while at 10 kB it is only 2.5 times slower. 

Note that the round-trip, times for Notification are 
mostly insensitive to the size of the state update. Es- 
sentially, the cost of sending 10 bytes and 100 kilobytes 
with a notification is roughly the same. We think the 
cause of this lies in the format of GT3 notification mes- 
sages; this is something that might be worth examining 
for later versions of the toolkit. 

We also observed very high variance in samples: 
the standard deviation is sometimes higher than 50% 
and in one case is larger than the mean. This last case is 
due to a single outlier in the Call experiment for 100 kB 
of state update, which took 1 .7 seconds. We think that 
much of this large variance is an artifact of Java garbage 
collection or other background processing in the Java 



Table 1. State transfer round-trip time (miiliseconds) 







10 B 


100 B 


1 1 n 

1 kB 


10 kB 


100 kB 


Notification 


Median 


192.5 


189.0 


193.0 


195.5 


190.0 




Mean 


201.3 


191,8 


196.7 


199.8 


192.7 




St. Dev. 


29.8 


13.0 


15.5 


23.6 


14.8 


Call 


Median 


19.5 


19.0 


26.5 


30.0 


209.0 




Mean 


26.4 


24.2 


33.1 


32.6 


299.0 




St. Dev. 


12.0 


9.2 


17.9 


6.3 


333.0 


Socket 


' Median 


1.0 


2.0 


2.5 


12.0 


133.0 




Mean 


1.5 


1.7 


2.5 


14.8 


144.5 




St. Dev. 


0.8 


0.5 


0.5 


11.9 


32.4 



Table 2. Client request round-trip time (milliseconds) 







lOB 


100 B 


1 kB 


lOkB 


100 kB 


Notification 


Median 


242.0 


241.0 


240.0 


238.5 


232.0 




Mean 


252.3 


247.4 


251.2 


301.1 


241.2 




St. Dev. 


41.9 


36.5 


36.6 


257.9 


34.2 


Call 


Median 


65.0 


72.0 


76.5 


74.0 


261.0 




Mean 


71.4 


78.0 


89.7 


82.5 


350.8 




St. Dev. 


21.9 


28.9 


40.9 


21.9 


333.5 


Socket 


Median 


45.5 


47.0 


48.5 


55.5 


182.0 




Mean 


52.8 


56.5 


55.5 


62.6 


195.9 




St. Dev. 


21.6 


26.6 


21.8 


22.9 


50.4 



virti^l machine or the Grid container. 

Table 2 shows round-trip times of client requests 
during normal, failure-free operation, as measured by 
the client. We would expect these numbers to be, ap- 
proximately, the sum of the request round-trip time 
without primary-backup replication plus the state up- 
date overhead shown in Table 1 above. That is, indeed, 
the case since the request round-trip time of a normal 
Grid service, without replication, is 54.3 ms on average 
with the median being 44 ms. The data from the pre- 
vious two tables is summarized graphically in Figure 5, 
where client request round-trip time is broken down into 
interaction between the client and the primary (white) 
and the interaction between the primary and the backup 
(solid, upward diagonal, and downward diagonal). 

In Table 3, we normalized the data of Table 2 by di- 
viding the median and mean numbers by the median and 
mean of the normal Grid service round-trip. So, each 
number shows the magnitude of overhead imposed by 
replication. The table shows that with Socket the me- 
dian overhead of replication is small: for small state 
sizes (up to 10 kB) is 30% or less. With Call, the me- 
dian overhead is 70% or less for small state sizes. For 
large state sizes all approaches perform similarly, with 



overheads of 400% and more. 

From these results, we conclude that notifications 
are considerably less efiicient than socket messages and 
service calls for small state sizes. For larger state sizes 
all of the three approaches impose a high overhead. 
Note that in all cases the requests have very low over- 
heads. In this situation a request that used to take 44 ms 
ends up taking between 4 and 6.5 times as long with 
replication. For Grid services that have longer-running 
requests, the overhead of replication will be diminished. 
For example, for a request that takes 3 seconds to exe- 
cute and has state size of 100 kB, the overhead of repli- 
cation is less than 1 0%. Hence, the drawback of using 
GT3 to implement primary-backup becomes negligible 
for long-running requests. 

3.3 Failover 

Another important metric for the performance of a 
fault-tolerant system is failover duration. This is a sum 
of two quantities: the time it takes for the backup to de- 
tect the failure, and the time it takes for the backup to 
notify the clients of a failover. The first quantity de- 
pends on the fi-equency of heartbeat messages and is 
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Figure 5. Median Round-trip times for state transfer using Socket, Call and Notification with differ- 
ent state sizes. The white portion is the round-trip time of a client request without replication. 
Therefore, the overall size of the each bar is the total client round-trip time. 



Table 3. Ratio of client request round-trip with primary-backup to median/mean client round-trip 
without replication 







lOB 


100 B 


1 kB 


lOkB 


100 kB 


Notification 


Median 


5.5 


5.5 


5.5 


5.4 


5.3 




Mean 


4.6 


4.6 


4.6 


5.5 


4.4 


Call 


Median 


1.5 


L6 


1.7 


1.7 


5.9 




Mean 


1.3 


1.4 


1.7 


1.5 


6.5 


Socket 


Median 


1.0 


1.1 


1.1 


1.3 


4.1 




Mean 


1.0 


1.0 


1.0 


1.2 


3.6 



Table 4. Failover duration (milliseconds) 

1 2 4 8 16 

Median 189 215 524 896 1989 

Mean 194 302 547 1018 2269 

St. Dev. 21 352 83 454 666 



largely independent of the implementation. Therefore, 
we only measure the second quantity, as shown in Ta- 
ble 4. 

With one client, it takes 194 ms on average to no- 
tify the client of a failure. As the number of clients 
increases, the notification overhead increases linearly. 
In [17] we report that recovering a network connection 
endpoint in less than 200 ms requires significant invest- 
ment in equipment for logging of packets that may be 
lost due to the failure, and so we believe that 194 ms is 
quite acceptable, if the number of clients that shares the 
same instance is large, however, then the overhead may 
become too large. Again, these results were obtained 



based the current implementation of GT3. A later ver- 
sion may be able to have notifications run faster than 
linear in the number of sinks. 

Note that if the client doesn't have outstanding re- 
quests to the primary service when the failure happens, 
then the overhead of the failover at client is almost zero, 
since the client only needs to change the address of the 
service invoked. 

4 Conclusion 

Fault tolerance of stateful Grid services is becom- 
ing increasingly important with the development and 
use of OGSI. Both the infrastructure services such as 
monitoring, resource allocation, and scheduling, as well 
as Grid applications implemented as Grid services, are 
required to be reliable and highly available. In this pa- 
per, we showed that the facilities defined in OGSI and 
the newly proposed WS-Notification extension to Web 
services [11] can be used to design a primary-backup 
service. While not described in this paper, this ser- 
vice can be easily extended to multiple backups and 



plified version of the Condor Matchmaker service [14]. 
Nondeterminism arises in this service both from the 
way resources are selected and from priorities. 

Section 3 gives the performance resuhs. We found 
the performance penahy was, in fact, quite high. While 
some of this may result from the lack of performance 
tuning in GT3, we believe that our findings also have 
larger implications related to how and where replication 
should be used to provide fault tolerance in Grid service 
architectures. 

We do not consider client failure in this paper. One 
of the attractions of the primary-backup approach is 
that it defines a very simple client-server protocol that 
does not depend on clients being reliable. In other 
words, the correctness of the server, in terms of how 
it responds to requests, does not depend on help from 
the clients, which means that client failures can be dealt 
with using orthogonal approaches such as timeouts and 
leases [15]. We also do not consider software bugs that 
can lead to completely correlated failures. In this case, 
the primary and all backups could simultaneously crash. 
Again, there are separate techniques that are used in 
practice for tolerating such failures. 

2 Architecture 

We first describe the primary-backup approach to 
replication. We then cover the concepts behind Grid 
services, and then give a design of a primary-backup 
service on top of OGSI. 

2.1 Primary-Backup replication 

Primary-backup is a well-known technique for 
making services highly available [1, 5, 6]. A client 
sends a request to the primary, which receives and exe- 
cutes the request. The primary then sends a state update 
message to the backups and replies to the client. Typ- 
ically, the primary does not reply to the client until it 
knows that all backups have received the state update. 
This is done to ensure that the backups are always con- 
sistent with the client: it is impossible for the client to 
know that the primary executed the request without the 
backups also knowing this. Figure 1 shows a space-time 
diagram of the execution of a simple primary-backup 
protocol. 

Primary-backup requires that a client be notified 
that the primary has failed and allow the client to re- 
bind to the newly-appointed primary. Ideally, this abil- 
ity should be available below the level of the service re- 
quest: doing so allows a client designed to interact with 
a single non-replicated server to be transparently ported 
to interact with a primary-backup service. 
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Figure 1. Primary-Backup Protocol 

Primary-backup has a few drawbacks! First, it is 
best suited for tolerating benign failures such as crashes 
and message loss, rather than arbitrary or malicious fail- 
ures. Unless malicious failures are a concern in a spe- 
cific Grid environment, we consider masking only be- 
nign failures a worthwhile tradeoff for the ability to 
run non-deterministic applications. Another drawback 
is that primary-backup requires that the environment is 
synchronous enough to support the use of heartbeats to 
detect failures. In practice, this means that the primary 
and the backups need to run on a cluster that is man- 
aged by some failure detector and recovery manager. 
Commercial products, such as the VERITAS Cluster 
Server [16], can be used for this purpose. 

2.2 Grid Services 

Grid services in OGSI combine Grid technologies 
with Web services to provide a platform for building 
distributed applications. Unlike Web services, Grid ser- 
vices are stateflil and may be short-lived. The OGSI 
model allows each client to choose among several avail- 
able instances of a service or create its own instance. 
The instances may have a limited lifetime since re- 
sources can be created to serve certain clients and are 
removed after they are no longer needed. 

Interaction between a Grid service and a client hap- 
pens in a request-reply fashion using strictly-defined in- 
terfaces and a certain encoding of data (interfaces are 
described by WSDL and the messages are encoded us- 
ing SOAP, both of which are XML-based). In addi- 
tion to regular requests and replies, Grid instances may 
subscribe using an OGSI-specified interface to notifi- 
cations, which asynchronously alert subscribers (called 
sinks in OGSI terminology) of state changes. Using no- 
tifications avoids wasteful polling. 

Notifications can be of two types: a push notifi- 
cation sends information along with the notification, 
whereas a pull notification is used to indicate that some- 
thing has changed: it is up to the notification subscriber 
to request (or pull) the information using a regular re- 
quest. Pull notification gives the subscriber the freedom 
to decide whether and when to get information associ- 



ated with the notification, while push notification avoids 
the overhead of that additional call in situations where 
the information is needed immediately. 

2.3 Primary-backup for Grid Services 

There are three general problems that any imple- 
mentation of a primary-backup mechanism needs to 
solve: 

1 . Transfer of application state. Before replying to 
the client, the primary needs to send the change in 
its state to the backups. A reply can be sent to the 
client only when it is known that the backups will 
eventually apply the state change. 

2. Detecting failures. Crashes and lost messages need 
to be detected. This is normally done by setting a 
timeout for every message. If no messages are sent 
for a long time, then a heartbeat message can be 
sent to check on a machine. 

3. Switching to a new primary. Originally, one of the 
service instances is designated as a primary and 
others as backups. After a failure of the primary, 
the backups agree on a new primary and ensure 
that all future requests are directed to it. 

Grid service notifications are a natural mechanism 
for solving all three of these problems because state up- 
dates and failures are inherently asynchronous events. 
Also, notifications provide a simple mechanism for 
disseminating information to a number of interested 
parties — several backups may be interested in the same 
state update, and several clients may be interested in 
the same failure notification. Consequently, in our sys- 
tem, backups register with the primary as sinks for 
state update notifications and heartbeat notifications, 
and each client registers with each backup as a sink for 
the failover notification that tells it to switch to a differ- 
ent primary and to resend the last request if it was ex- 
pecting a reply. We use a push notification for the state 
transfer because the backup needs every state update. 
For heartbeat and failover, pull notifications are used 
because there is no data associated with those events. 

The normal execution proceeds as follows. A 
client makes a Grid service request to the primary, 
which executes the request. When execution ends, the 
change in the state of the service is extracted and sent 
to the backups via a notification. When the primary 
collects acknowledgments from all backups it replies 
to the client. The state extraction and injection are 
application-specific: the Grid service needs to support 



methods that allow this to be done. In addition, the ser- 
vice can be designed to have the primary send check- 
points to the backups if its computation is long-running. 

Failure of the primary is detected by backups when 
they do not receive a heartbeat message after a certain 
period of time. This method allows detection of host 
and task crashes, as well as network partitions. At that 
point, the backups need to cooperate in election of the 
new primary. The newly elected primary then sends a 
failover notification to the client so it can obtain a new 
server instance handle. If the client was expecting a re- 
ply from the service when a failover notification arrives, 
then the client resubmits the request to the new service 
instance. If the old primary had already sent a state up- 
date to the new primary, then the new primary can reply 
with the result computed by the old primary. Otherwise, 
it can compute the result itself (perhaps starting from a 
checkpoint if the primary had sent checkpoints to the 
backups). 

Failures of the backups do not interfere with the 
operation of the surviving system components, so the 
only new issues are the detection of backup failures and 
the integration of new backups into the system. Neither 
is conceptually hard to implement, although integration 
of a new backup may require a large amount of state to 
be transferred. The details of how to best do this are 
outside the scope of this paper. 

2.4 Implementation details 

In this section we give a more detailed overview of 
our system using pseudocode to illustrate key actions 
performed by each of the three participants: a client, a 
primary, and a backup. Each one is enclosed in an ob- 
ject with private variables and methods. Note that we 
use C language convention for pointers: &x is a refer- 
ence to variable x. 

The client code, shown in Figure 2, is interposed 
between the client application and the original SOAP 
stub in such a way that client code is not changed. The 
original stub supports the INIT method, which is called 
when the client binds to a Grid service, and a number 
of operations, shown here collectively as OP. We in- 
tercept INIT with the INIT-CLIENT method to register 
for receipt of failover notifications from each backup. 
The OP method spawns a separate thread, implemented 
by lNVOKE_OP, to invoke the operation via the original 
stub. 

If the primary crashes during this invocation, two 
things will happen in arbitrary order: the call to 
stuh.opi) will return an error message, and a fail- 
ure notification will arrive from the backup, causing 
FAILURE-HANDLER method to execute. To reduce the 



to dynamically adding backups. In addition, by us- 
ing slightly modified client stubs, failover can be done 
transparently to clients. We did need to make strong as- 
sumptions on failure detection, but they can be satisfied 
by existing commercial software. 

We found the overhead of using GT3 implementa- 
tion of the OGSI notification to be quite high. The over- 
head is particularly large in the cases where the state 
data is small or the number of clients is large. Much 
of the overhead seems to come from the cost of noti- 
fications, which can most likely be improved in fliture 
implementations of GT3. Failing that, one might wish 
to provide state update below the OGSI level or by us- 
ing simpler OGSI facilities such as basic Grid service 
method calls. It might be possible to improve perfor- 
mance of primary-backup by using an alternate proto- 
col binding — something that is specified in OGSI but 
not available in GT3 — ^but we have not explored this 
option in any detail. 

Our approach for primary-backup is only appli- 
cable for replicas located in a cluster, since otherwise 
failure detection becomes too unreliable for primary- 
backup. We are currently looking at methods that still 
accommodate nondeterminism, like primary-backup, 
but that can work in a wide-area network where asyn- 
chronism is more of an issue. 
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Executive Summary 

Sun Grid Engine (SGE), SGE Enterprise Edition (SGEEE), and Platform Comput- 
ing's Load Sharing Facility (LSF) are three distributed resource management soft- 
ware solutions. The source codes for SGE and SGEEE are available in the Grid 
Engine open source project sponsored by Sun Microsystems and hosted by Col- 
labNet. The Grid Engine open source project was launched in June 2001. Refer- 
ence releases are available via downloads from die Grid Engine open source pro- 
ject (wwwgridengine.sunsoUrce.net). 

The downloads are 100% compatible with the Sun products, SGE and SGEEE, 
apart from a different binary code license. SGE is free and available only through 
downloads from www.sun.com/gridware. SGEEE is available from Sun and its 
resellers only on CD at a cost that is dependent on grid size. 

SGE and LSF provide comparable functionality and are suitable for what Sun refers 
to as cluster grids (one-pro jea, one-department grids). SGEEE provides distrib- 
uted resource management in more sophisticated grids — i.e., enterprise grids 
(multiple projeas or departments, one organization) — and provides the basic 
functionality for global grids (i.e., colleaions of enterprise grids that cross organi- 
sation boundaries). SGE also provides basic functionality for global grids, but not 
as much as SGEEE. 

Moving from single departmental grids to more sophisticated grids requires hu- 
man cooperation. SGEEE implements the concept of policies for specifying how 
humans will cooperate in multidepartment, multiproject grids. That is, policies 
define how computer resources are distributed among projects and people. For 
example, some users may not want to share their CPUs (central processing units) 
on certain days of the week or with specific groups. As a resuh, policies are im- 
plemented at a level above the distributed resource management layer. 

SGEEE's policy module provides the following four benefits, which neither SGE 
nor LSF provide: 

1. It introduces new utility computing-like parameters for scheduling. 
Scheduling jobs is not based on priority alone. With SGEEE, a user, 
team, department, or project can receive a resource allocation, for a pe- 
riod of time, based on some percent of the total resources available. 
SGEEE will ensure that the assigned percentage of resources is available 
to the jobs within that project or for a user, team, or department. 

2. It enables collaboration. Users and project teams can negotiate resource 
assignments that can vary from week to week. For example, a project 
team may get 10% of the resources this week and 30% of the resources 
next week. This type of negotiation allows project teams to better man- 
age the start and completion of their projects and allows a project team 
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with a "hot" projea to negotiate sufficient resources to ensure that the 
projea can finish on time. 

3. It offers management alignment of compute power (cumulathre usaee 
over time) to specific projects that have high importance. 

4. The more you pay the moie you get. Cumulative resource usage can be 
proporuonal to budget contributions. 

Sun and Platform Computing have taken diflferem approaches to developing grid 

doTrr u"" solutions With necesfa^^^c- 

uonahty mtegrated within a single product such as SG^/EE; whereas. PlatSm^ 
approach ,s to provide a base product (LSF Base combined with LSF Batch)^d 
d^n surround « wjA other produas to complement the functionality in the base 
offenng. Even w.th the purchase of additional Platform produm 
MuluCIuster and LSF Parallel (beyond LSF Base and LSF Batch). LSF cannot 
provide the funaionality available in SGEEE. 

'Lt^::::^'^- ^^^^^^^ 

• SGE and LSF provide comparable fiinaionality and provide basic re- 
source management for cluster grids. 

• SGEEE is the most comprehensive, cost-eflfective grid computing solu- 
uon currenrfy on the market, one that is capable of providing resource 
management for enterprise grids and aspects of global grids. 

• Sun's most serious competitors in grid computing — HP (Hewlett- 
Packard) and ffiM - are relying heavily on Uiird-party proprietary prod- 
ucts from companies such as Platform Computing versus building their 
own. It IS Aberdeen's perspeaive that Sun, by leveraging the contribu- 
nons of die Gnd Engine open source project, is able to provide superior 
mnctionahty at a lower cost. . " 

• While Sun is ahead of its competition, buUding its own grid products it 
IS mcorpoming concepts developed by the leading research organiza- 
tions (of which Sun is a primary participant) in grid computing - the 

P^°j^« (www.globus.ofg), the Global Grid Forum 
SSS: r f Distributed Resource Managemem Application API 

^^GGR ^"^^^"^ "^'^ ^"'^^ ^""^ Veridian) 

What Is Grid Computing? 

l^T^S! "T'"''^ ~"^»«>"^0". sharing of data files and 
ttn and "leans of interaction across departments widiin an organiza- 

JZ^^ ^'^^r'- These requirements have led to new ways of per- 

formmg apphcation developmem and deploym ent. The requirements have been. 

© 2002 Aberdeen Group, Inc. ' -—r-, 

^ _ Telephone: 617 723 7890 

One Boston Place 

Fax: 617 723 7897 

Boston, Massachusetts 02108 

www.aberdeen.com 



Sun*s Grid Computing Solu tions Outdistance the Competition a 

for the past few years, primarily the concern of developers of distributed systems 
for scientific and technical research. Work within this community has led to the 
formation of grid computing technologies. 

In general, the specific problem that grid computing tries to solve is coordinated 
resource sharing in dynamic, multi-institutional virtual organizations (VOs). Ex- 
amples of VOs include service providers, manufacturers, and organizations in- 
volved in coUaborative problem solving. This description of the problem that grid 
computing tries to soh^e aligns perfectly with Sun's notion of global grids, which 
span organizations. However, cluster grids and enterprise grids are also real ex- 
amples of grid computing. At Sun, grid computing is defined as the pooling of re- 
sources into virtual systems. Sun inuoduces scaling with cluster grids, enterprise 
grids, and global grids. 

Sharing is concerned with direct access to computers, software, data, and other 
resources, i.e., the type of sharing required for collaborative problem solving. 
Sharing is highly controlled, with resource providers and consumers defining what 
is shared, who is allowed to share, and the conditions under which sharing occurs. 
Sharing relationships must be flexible for sophisticated and precise levels of con- 
trol of how shared resources are used. On die other hand, the relationships must 
be definable so that they can be implemented and enforced. 

Resource sharing policies are conditional statements. That is, a resource owner 
makes resources available subject to constraints on Mvhen they can be used, how 
they can be used, and for what they can be used. The implementation of con- 
straints requires mechanisms for expressing policies, for establishing the identity 
of a consumer or resource, and for determining whether or not the use of a re- 
source is consistent with the specified sharing relationships. Any policy mecha- 
nism must be capable of handling relationships that can vary dynamically over 
time, in terms of the resources involved, the nature of the accesses permitted, and 
the participants to whom access is permined. In addition, a new participant (indi- 
vidual, group, or organization) must be able to "discover" the nature of relation- 
ships that exist at any given time. For instance, a. new participant must be able to 
determine what resources are available for it to access, the quality of the resources, 
and die policies that govern access. A single resource may be shared in several 
ways, possibly in different ways for each participant. 

Today, distributed computing technologies do not address many of the concerns 
Bsted for sharing resources. The Internet, for example, addresses information in- 
terchange among computers, but it does not address the types of flexible policies 
required for sophisticated resource sharing at various computer sites by individu- 
als, groups, and organizations. That is where grid computing becomes important. 
In the past few years, researchers within the grid community have produced pro- 
tocols, services, and tools that address the challenges of resource sharing. These 
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technologies include security solutions. resou«« mankgement protocols and 
pobc.es that support secure remote access to computing and data resourires. etc. 
Members of the Globus Project (www.globus.org) have published an Open Grid 
Services Arch.tectt« (OGSA) proposal. The OGSA proposal outlines interfaces to 
gnd compuung software that comply wid, Web services standards. The objeaive is 
to take advantage of Web services properties such as service description and dll 
thenrTc^r ' f T''"^ '^^^P'^**' '^"^'^^ scheduling, au- 

st^^d Web services architecture. Web services are a natural fit wid, the undSy. 

1^«T''' f ^ '^"P"'^^ ''^"^"^ »° -^n"-" applications 

across h^e heterogeneous networks using Web-related standards like Web Ser- 

Tc^rsipT''''^ 

While grid computing originated in the research worid and few, if any, commercial 
gnds are m use today, it is Aberdeen's perspective diat grid computing wm lo 

Tv^tZr^^' ^ 'I commercial appUcations including e-Business 

appbaitions. The ,ump to the commercial worid is not difficult because groups 
work in teams and share resources just like scientific researchers do, and tfiey 
want die same three things: ^ 

1. Efficient use ofhardware and workload balancing; 

2. Quality of service - which nodes are working, which nodes are not 
workmg, route around overioaded nodes, etc.; and 

3. Flexibility — virtual access to resources. 

Grids a potential solution to an organization s lack of compute power and to a 
more efficient use of existing resources. In many companies, desktop utilization is 
relatively low. on die order of 20% or less, and there is often a dupuiion of re 
sources across groups. Grids, via distributed resource management software, ag- 
^gate available compute resources and deliver compute power as a networic L 
vice. Gnds increase productivity by matching workload and resources. And per- 
haps more miportantly, grids make it as easy to use many CPUs as to use one al- 
lowing die user to be much more productive. 

Software Requirements for Grid Computing 

Grid computing requires a coUection of software features diat contribute to man- 
agmg the resources in heterogeneous, distributed computing environments. It is 
Suo^^g'"""""" sophisticated grid computing solutions must address 
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• Coordinated resource management — managing the use of shared re- 
sources to best achieve an enterprise's goals such as produaivity, timeli* 
ness, level of service, etc. 

• Dynamic policy mechanisms — mechanisms capable of handling rela- 
tionships that can vary dynamically over time, in terms of the resources 

invoh^ed. 

• Discovery — participants must be able to determine what resources are 
available for access, the quality of resources, etc. 

• Dynamic scheduling — the capability to remove resources from a low 
priority job and give them to a higher priority job so the higher priority 
job can complete execution. This is done without killing the lower pri- 
ority job. If no other higher priority jobs are remaining, then the re- 
sources can be given to lower priority jobs to continue execution. 

• Controlled sharing of resources — enforce policies that constrain access 
according to group membership, ability to pay, etc. 

• High-level policy administration — distribution of computer resources 
among projects and people versus the lower level concept of distribute 
ing resources among jobs. 

• Security — security services define standard functions for identifying in- 
dividuals in communicating parties, encrypting messages, and so forth. 

• Heterogeneous, distributed computing environments — grids that span 
projeas and organizations almost always utilize platforms from several 
suppliers. 

• Checkpointing capability — ability to move jobs from host to host dur- 
ing execution without restarting. 

• Open standards — : standards-based solutions facilitate extensibility, 
interoperability, portability, and code sharing. 

• Adjust to various customer attitudes toward grid computing — 
architectures should be flexible enough to satisfy distributed applica- 
tions and satisfy user application requirements versus the grid architec- 
ture's forcing a structure on the applications. 

• Differences with respect to high-performance computing (HPC) — grid 
computing and high-performance computing are intertwined, but they 
are not the same. Some HPC applications may suffer using some grid 
computing approaches, e.g., bandwidth- and latency-dependent applica- 
tions deployed across multicluster grids or grids spanning distances. 
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Market Potential for Grids 

Grids have the potential to impact the marketplace in a number of ways by ena- 
bbng more complex simulations and modeling efforts, expanding and sharing ac- 
cess to databases, and speeding communication between collaborators workkig on 
common projects. * 

Today, grids are in the very early adopter stage widi the primary use coming from 

"^^"T' such as Electronic Design Automation 

(EDA), hfe saences, and others. The primary drivers are resource optimization 
access to resources, cost sharing, and improved models for managing resourced. 
Sun's SGE and SGEEE 

Sun has two grid computing solutions - SGE 5.3 and SGEEE 5.3. SGE is suitable 
nlto'ft ' ^^^^ ""P^'" of handling enterprise grids and some as- 
^, /k^? announced SGE in September 2000. the product 

already had a Hve-year history. Sun acquired Gridware, developer of the iSS 
product, m July 2000. 

The front-end developmem for both SGE and SGEEE is done in the Grid Engine 
open source project (www.gridengine.sunsource.net). Sun does not deviate from 
the source code produced via the Grid Engine project for releases of SGE/EE Ref- 
erence releases, which are funaionally identical to SGE and SGEEE at a point in 
Ome, are available via die Grid Engine projea. SGE and SGEEE are both made 
nTl c"*T^ tree in the Grid Engine projea and share internal compo- 

?K K^r.r? '° '^^^^ ^ new version of SGE and SGEEE, it brings a 
stable budd of Gnd Engine software into the Sun quality assunmce process aS 
documents and productizes the software under the SGE/EE brands. 

Ir^^ "T,"^^" download SGE from wwwsun.com/gridware and who buy 
SGEEE, Sun dehvers controlled upgrades via patches (through support services 
contracts). Sun does not provide support contraas on the binaries available in the 
Gnd Engine open source project. But most enterprise customers want a sup- 
ported version and are willing to pay for support. Anyone - including Sun's com- 

veZn ' H r ^""^ ^"^^ P™^^« ^"^^ '° -"^ke their own 

version or Gnd Engine. 

Product Comparisons 

Functionality and feature comparisons for SGE, SGEEE, and LSF are presented in 
the foUowmg sections, and Appendix A contains detailed overviews of the three 
gnd product offerings. 

Comparison of SGE 53 and LSF 4,2 

Bodi SGE and LSF provide ftinaionality for what Aberdeen considers to be basic 
resource management — bodi solutions are suitable for supponing cluster grids, 

© 2002 Aberdeen Group, /nc " ' 



Telephone: 617 723 7890 
One Boston Pletce 

fax: 617 723 7897 

Boston, Massachusetts 02108 

tmuw.aberdeerucom 



Siui*5 Grid Computing Solutions Outdistance the Competition 7 

but not enterprise or global grids. One of the major diiferences between SGE and 
LSF is that SGE is free and LSF Base plus LSF Batch costs $995 per CPU for Unix 
and $399 per CPU for Linux. Support for SGE is available from Sun via the Web 
and via a software-only support contract with Sun Enterprise Services. Support is 
also available from Sun partners and resellers. 

Another difference between SGE and LSF is that SGE is a solution with all the func- 
tionality required to provide resource management for cluster grids integrated 
within a single product offering. Platform Computing's approach is to provide a 
base product (LSF Base with LSF Batch) and surround it with other offerings — 
LSF MukiCluster, LSF Parallel, LSF JobScheduler, LSF Make, and LSF Analyzer — to 
complement the funaionality in the base offering. 

LSF is built in layers. For this reason, LSF is composed of several individual prod- 
uas. The base system (LSF Base) provides a basic level of services that allow users 
to perform dynamic load sharing and distributed processing. However, if sophisti- 
cated job scheduling and resource allocation policies are necessary — as they fre- 
quently are in grid computing — more complex scheduling must be built on top 
of LSF Base using LSF Batch. (LSF Batch is a batch system built on top of LSF Base 
to provide distributed batch job scheduling services.) That is the reason LSF Base 
and LSF Batch are both necessary to provide a base product for grid computing. 
The disadvantage of Platform's approach is that the user has to pay additional 
costs for each of the separate offerings. A description of the additional LSF prod- 
ucts can be found in Appendix A. 

SGE and LSF both provide the following functionality: 

• Batch processing — submit jobs and process them as soon as the re- 
sources are available; 

• Dynamic allocation of resources — allocate resources as they are re- 
quired and release them when not needed; 

• Fault tolerance — automatically resubmit jobs that fail; 

• Failover capability — grid continues to operate if one or more hosts 
fail; 

• User-specifiable resources — at submission timie, a user can specify re- 
sources needed to complete a job; 

• Resource location independence — the user does not know or care 
where the compute resources are located in the grid; 

• fob status — users want to know what is happening to their job at any 
given time; 

• Host status — system administrators need to know the utilization and 
up/down status of all hosts in a grid; 
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• Centralized management — system administrators need the capability 
to manage an entire grid from one appUcation, which is not die same as 
having a single machine from which to manage a grid; and 

• Suspend/resume jobs — the user (and system administrator) has the ca- 
pability to halt a job and restart it later without losing the work already 
completed. 

Table 1 provides a comparison matrix of important features for SGE 5.3 and LSF 4.2. 
The concept of queues in SGE and LSF requires some explanation because diey 
aflFect scheduling in the two product offerings. When a job enters an LSF queue, it 
must remain in the queue until its execution is completed, unless a system admin- 
istrator manually moves it to another queue. LSF queues are distina from hosts. 
All LSF jobs in a high-priority queue run before any jobs in a lower priority queue 
are staned. 

SGE queues are an expression of what resources are available on each machine in 
a grid. When a job is submitted to an SGE-based system, the SGE scheduler takes 
into account the order in which the job was submitted, what queues (hosts) are 
available, and the priority of the job. The scheduler places aU jobs in a single 
pending list and continuously re-evaiuates the availability of resources and priority 
of aU jobs in the grid. If a host has the resources available, SGE wiU automatically 
move die highest priority job to a queue on diat host to begin execution immedi- 
ately. That is an advantage for SGE (and SGEEE) widi respect to utilization of re- 
sources and throughput. 

An analogy for LSF queues is a grocery store where customers selea a queue and 
wait in line to be checked out. An analogy for SGE queues is a hospital emergency 
room where seleaion of those served next is made on a number of parameters, 
with the list frequendy re-sorted. 

SGE Versus SGEEE 

SGEEE contains aU of the functionality and features that SGE has plus the impor- 
tant concept of policies diat permit it to adapt to general grid computing models 
such as enterprise and global grids. SGEEE is substantially different than SGE (and 
LSF or other similar competitive products). The primary difference is SGEEE's 
policy module. 

Any SGE grid can be upgraded to SGEEE by upgrading the master host in die grid. 
SGEEE requires that the master daemon run on a Solaris (2.6 to 9.0)/SPARC master 
host; whereas, the master daemon in an SGE grid can run on Solaris (2 6 to 
9.0)/SPARC, Solaris (2.6 to 8.0)/!c86, Linux/SPARC, or Linux/Intel. Master daemons 
m both SGE and SGEEE are 100% compatible to execute daemons from Grid 
Engine open source projea builds such as AK, HP-UX, IRK, Linux, and others. 
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SGEEE Versus LSF 

Aberdeen does not view LSF as a competitor to SGEEE because LSF does not pro- 
vide the policies required to manage resources in enterprise grids. LSF is priced at 
$995 per CPU for Umx. For more than 100 licenses, SGEEE is priced at approxi- 
mately $300 per CPU. Not only is SGEEE much less expensive than LSF, but it also 
contains significant fimaionality that LSF does not provide. 

SGEEE Versus Aberdeen 's Software Requirements for Grid Computing 
When the functionality and features that SGEEE provide are compared with the list 
of software requirements that Aberdeen considers important for grid computing, 
SGEEE fares very well. For instance, it is Aberdeen's perspective that SGEEE is 
more than adequate for controlling and resolving resource sharing in dynamic VOs 
where collaborative problem solving is a must. SGEEE's policies are capable of de- 
fining what is shared, who can share, etc. And, importantly, policies defined by the 
use of SGEEE's policy mechanism can be implemented and controlled. 

SGEEE provides for dynamic scheduling and controlled sharing of resources. It 
thrives in heterogeneous, distributed computing environments; is based on open 
standards; provides an adequate degree of security across departments and or- 
ganizations; and is flexible enough to allow jobs to be automatically re-directed to 
new hosts when resources become available. 



Table 1: Comparison of Important Features — SG£ Versus LSF 



Features 


SGE5.3 


LSF 4.2 


Cost 


Free 


$995 per CPU for LSF Base . 
and LSF Batch for Unix; 
$399 per CPU for Linux 


Dynamic scheduling 


Yes 


No 


Transparent to application 


Transparent to application if 
compatible with the underlying 
operating system 


Transparent to application if com- 
patible with the underlying 
operating system 


Scripts for job submission 


Scripts specific to the 
environment are required 


Scripts specific to the 
environment are required 


Ease of installation 


Very easy to install, takes 
one to two minutes per host 


Takes about a half-day to create 
a grid of 50 to 100 hosts 


Ease-of-use 


Easy to use (based on infonnation 
collected at download time) 


More difficult to use than SGE 
(based on download infonnation] 


Scalability 


There are separate daemons for the 
master and scheduling functions 
that can run on separate CPUs in a 
dual processor host, allowing more 
jobs to be processed simultaneously 


Places master and scheduler 
functions in one daemon, creating 
potential delays in scheduling 
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Features' : 



SGE 5.3 



LSF4.2 



Enterprise grid-like capability 


Not available in SGE. but available 
in SGEEE 


Limited forni using LSF MultiClus- 
ter at an sdriitlnnAf met nf ^onn t/\ 

$300 per node 


Specify queue name at job 
submission 


Not applicable 


Name for a specific queue can be 
submitted with thepb 


Array memory for submitting 
large numbers of jobs at the 
same time 


Yes 


Yes 


Globus support 


SuDDorted under SGF/FE 


Yes 


Parallel job execution 


Yes 


wiin LbF Parallel at an additional 
cost 


Heterogeneous computing 
environments 


Supports AIX.HP*UX IRIX Solaris 
Tru64 UNIX. Linux 


Qlinnnrie AIY LID 1 IV C^i*^*;^ 

ouppons AiA, nr-ux, oolans, 
Tru64 UNIX, Linux, Mac OS X. 
yvinaows 


Master host configurations 


Each grid requires a master host. 
SGE master daemon must ain on 
Sofaris/SPARC. SolarisA(86, Linux/ 
SPARC, or Linux/Intel. SGEEE 
master daemon must run on 
Solaris/SPARC 


Each grid requires a master hosL 
No ptatform restrictions on master 
host 


juu uiic^Kpoiniing 


Yes 





GUI support 


Yes 


Yes 


Command line interface 


Yes 


Yes 


Error logging 


Yes 


Yes 


Required resources can be 
requested at job submit time 


Yes 


Yes 


Job accounting, e.g., submit, 
execution times 
Job arrays 


Yes 


Yes 


SNMP agent support 


Yes 
No 


Yes 

Yes 


Adding/removing hosts without 
shutdown | 


Yes 


Yes 



Source: Aberdeen Group, May 2002 

Aberdeen Conclusions 

WhUe grid computing originated within the scientific and technical market seg- 
ments, it is just as appropriate for commercial applications and any type of com- 
puting where sharing of ffles and databases as well as coUaborative forms of inter- 
action across projeas, department, and organizations are imponant. With SGE 
5.3 and SGEEE 5.3, Sun is able to provide the funaionaUty required for today^s 
f^^i'I^'^''' ^ computing moves more and more to the forefront, Sun with 
SGEEE 5.3 and its concept of policies, is positioned to be the leading supplier of 
gnd computing products for the future. 
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Sun's most serious competitors in the scientific and technical market segments 
where grid computing originated — HP and IBM — are trailing Sun in the devel- 
opment of grid product offerings. While Sun has jumped out front in the grid 
market with its own products — SGE 5 3 and SGEEE 5.3 — the competition is rely- 
ing on third-party offerings such as Platform Computing's LSR Aberdeen con- 
cludes that LSF 4.2 falls far short of SGEEE 5.3 in the functionality required to cre- 
ate sophisticated grids, and LSF 4.2 costs significandy more per CPU than does 
SGEEE 5.3. 

Sun is positioned to respond to new grid computing requirements through its cus- 
tomer base and through its involvement with the open source Globus project 
(www.globus.org) and the GGF. Sun was the first sponsor of GGF in 1999. Sun is 
convinced that standards arid open source are fundamental for the success of grid 
computing and suppliers and users. Sun was the first systems supplier to place 
key grid technology — the Grid Engine product suite — into open source. The 
Globus-based OGSA architecture is based on the same Web services standards that 
Sun ONE (Open Net Environment) Web Services are based on. Sun is providing 
input as an active member of the OGSA working group based on its experience 
with the large Sun ONE installed base. 

Sun also initiated the DRMAA working group within the GGF. This effort promises 
to give commercial application vendors a "write once" interface to utilize whatever 
DRMAA-compliant resource management system the customer has deployed, thus 
reducing deployment effort for end-users. A number of manufacturers of resource 
management systems are working to complete this API in 2002. 
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APPENDIX A 

Product Descriptions for SGE 5.3, SGEEE 5 3 
andLSF4.2 

SGE 5.3 

Sun Microsystems, Inc. 
4150 Network Circle 
Santa Clara, CA 95054 
^vww^un.com 
(650) 960-1300 

SGE 5.3 

fclffl ^ administrator is a user who is aUowed to foUy control 

SGE/EE. An operator ^ a user with administration privUeges who is noTX^d to 
change queue configurations. An owner is a user who is Slowed to si^plZots 

^r^^ziT^z'^.^^^^ AusercanZr;uiirr 

SGE Overview 

uons. SGE can be used with any type ofserver- dedicated or shared compute 
forms, and desktop systems. SGE is suitable for developing cluster grids 

The basic fonaion of SGE is to match available resources in a grid with users' re 
quests. SGE supports both batch jobs and interaaive jobs. A batch^^ H^heU 
; a^r^in^ T""' intervention/and it does nif^qui e a" 

TinL X tTr^"^,,"^. ' H ^"'^ " ' "^^'^ SGE commands that 

Computational resources are delivered to user jobs by SGE based on resouire 

SGEEeJ st°2 Sirt^r 'dT ^"^^^^ Projeas™ in 

mem sLr^r p ^ k ^1 "'^^ organization's technical and manage- 

ment stair. SGE uses the policies to examine available computational resoufceT 

"^'^ ''^^^ ^ocates^hem t^obsTaTan- 

ner mat optimizes their usage across the cluster grid. 



® 2002 Aberdeen Group, Inc. 

One Boston Place 

Boston, Massadfusetts 02 JOB 



Telephone: 617 723 7890 
Fax: 617 723 7897 
tvww, aberdeen, com 



Sun*s Grid Computing Solutions Outdistance the Competition 



13 



SGE Job Flow 

Users can submit batch jobs, interactive jobs, and parallel jobs to SGE. SGE sup- 
ports checkpointing jobs — jobs that can migrate from host to host within a grid 
without user intervention and based on load demand on the SGE system. Check- 
pointing is a procedure that saves the execution status of a job into a checkpoint 
area for the job, permitting the job to be aborted and restarted later without loss 
of information and already completed work. 

At a high level, SGE works in the following manner (Figure 1): 

• Accepts jobs from users; 

• Places jobs in a computer holding area until they can be executed; 

• Sends them from the holding area to a host where they can be executed; 

• Manages them during execution; and 

• Logs a record of their execution when they are finished. 

A user who submits a job to SGE specifies a requirement profile for the job along 
with user identification and a priority number. The requirements profile is a 

Figure 1: Job flow for SGE (Left) and SGEEE (Right) 




Source: Sun Microsystems, May 2002 
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statement of attributes associated with the job such as memoiy requirements, op- 
erating system required, etc. SGE schedules the most important jobs first. Based 
on the contents of the requirement profile, SGE dispatches the job to a suitable 
queue associated with an appropriate host server on which the jdb will be exe- 
cuted. SGE uses load balancing to spread workload among available servers. 

SGE Components 

SGE is composed of several components — hosts, daemons, queues, and client 
commands. An SGE grid cluster is composed of four types of host nodes — mas- 
ter, execution, administration, and submit. The master host controls all SGE activ- 
ity It runs the master daemon and the scheduler daemon. Implementing the 
master and scheduling functions in two separate daemons aUows these two fiinc- 
tions to run on diiferent CPUs in the same host, thereby improving scalability. 
These two daemons control aU queues and jobs and maintain tables about the 
sutus of queues and jobs, about user access permissions, etc. By default, the mas- 
ter host is also an administration host and submit host. Execution hosts are nodes 
that have permission to execute (and submit) SGE jobs, and they have queues as- 
sociated with them. 

An administrative host is a node fi-om which administration commands may be is- 
sued. It is responsible for carrying out administrative activity for an SGE cluster 
grid. Submit hosts are nodes that are permitted to submit batch jobs and query 
their status. A node in an SGE cluster grid can belong to multiple host classes 
simultaneously. 

The fimaionality of the SGE system is performed by four daemons. The master 
daemon maintains tables about hosts, queues, jobs, system load, and user permis- 
sions. The master daemon must run on a Linux- or Solaris-based host. The 
scheduler daemon maintains an up-to-date view of a grid's status. It determines 
which jobs are dispatched to which queues and then forwards its decisions to the 
master daemon, which initiates the required aaions. The execution daemon is 
responsible for the queues associated with the host on which it runs, and it peri- 
odically forwards the stanis of jobs and the load on its host to the master daemon. 
The communication daemon communicates over a weU-known TCP (Transmission 
Control Protocol) port. It is used for all communication among SGE components. 
An SGE queue is a container for a class of jobs allowed to execute concurrently on 
a particular host. Throughout their lifetimes, running jobs are bound to a queue. 
SGE users do not submit jobs to queues. Users specify the requirement profile of 
a job, and SGE software dispatches the job to a suitable queue on the host with the 
lowest workload. The command line user interface is a set of commands that al- 
lows users to manage queues, submit and delete jobs, check job status, and sus- 
pend/enable queues and jobs. 
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SGEEE5.3 

Sun Microsystems, Inc. 
4150 Network Circle 
Santa Clara, CA 95054 
www.sunxom 
(650) 960-1300 

SGEEE 5.3 

The primary difference between SGE and SGEEE is that SGEEE can consolidate 
multiple cluster grids into enterprise grids. This capability is provided in SGEEE 
with the incorporation of policies. These SGEEE policies are a level above job re- 
source allocation. Policies dictate how compute resources are distributed among 
projects and people, not jobs. 

The administrator of an SGEEE grid defines customized policies corresponding to 
what is appropriate for the use of the grid. Policies are needed to provide the 
flexibility required for managing resources across multiple projects and across 
multiple organizations. Project owners using an enterprise grid have varying re- 
quirements with respect to how they manage their projeas. They need capabili- 
ties to negotiate policies, flexibility for manual overrides for unique project re- 
quirements, and automatically monitoring and enforcement of policies. 

Policies/Tickets 

Wirhm SGEEE, tickets are used to distribute the workload. They are assigned to 
projects, users, and jobs. More tickets mean higher priority and faster execution. 
The following are four policies available in SGEEE: 

1. Share-tree based: This policy allocates a percentage of total compute re- 
sources to each user or project. If actual cumulative usage allocated 
over a period of time to each user or project exceeds the assigned value 
of resource allocation, then "borrowed" resources are "returned" to the 
other users. For example, if project A in a queue is not using all of its 
resources, then project B can gain access to them and use as much of A's 
resources as A allows. When A resumes normal activity, B must return to 
A a fraction of the borrowed resources. The longer B holds resources 
borrowed from A, the fewer resources B has to return to A. That is, if B 
holds some of A's resources for three weeks, then B has to return less 
than if B holds them for only two days. 

2. Functional: The functional policy is similar to the share-tree-based pol- 
icy, but B does not have to return resources "borrowed" from A. This 
policy forgets about the past. 
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3.. Deadiine: The deadline policy is a policy that is invoked whenever a job 
must be finished before or at a certain point in time and may require 
special treatment to achieve this. An administrator can manually give 
jobs extra resources 10 meet deadlines. The resources are relinquished 
after the deadline. The deadline policy is implemented by redistributing 
tickets — an administrator can assign a job more tickets to raise its pri- 
ority so that it can meet its execution deadline. 

4. Override: The override policy allows administrators to make resource 
allocation decisions manually instead of automatically by SGEEE. The 
override policy is implemented by giving additional tickets to jobs, us- 
ers, or projects to temporarily adjust their relative importance. 
SGEEE software uses these policies to examine the available computational re- 
sources within the enterprise grid. Then it gathers, allocates, and deUvers these 
resources automatically so that highly optimized resource usage is achieved. 
Figure 1 illustrates job flow for SGEEE. The job flows for SGE and SGEEE are simi- 
lar in some respects; however, the availability of policies for resource management 
at the people and projects level imposes another level of control on top of re- 
source management for jobs. 
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LSF 4.2 

Platform Computing, Inc. 
3760 I4th Avenue 
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LSF 4.2 

LSF is designed to be a layer on top of existing operating systems. LSF runs on 
most Unix systems, Linux, Windows 2000, and Cray machines. The base LSF 
product suite consists of LSF Base and LSF Batch. While LSF Base is a stand-alone 
offering, it must have LSF Batch to provide the sophisticated job scheduling and 
resource allocation necessary for grids. LSF Base provides the load sharing and 
distributed processing for the LSF solution suite. It provides services such as host 
selection, resource information, and transparent remote execution. LSF-based 
grids can span departments within an organization. AMD has a 1500-node grid 
that was created using LSF Base and LSF Batch. 

LSF Batch is a distributed batch system built on top of LSF Base to provide batch 
job scheduling services to users. LSF Batch accepts user jobs and holds them in 
queues until suitable hosts are available. Host selection is a function of up-to-date 
load information stored in the load information manager (LIM). LSF Batch holds 
submitted jobs in a job file imtil conditions are right for them to be executed. 

In addition to LSF Base and LSF Batch, LSF companion products include the 
following: 

• LSF Analyzer — provides workload analysis across a cluster of comput- 
ing resources. It generates charge-back accounting reports and removes 
botdenecks and helps tune overall system performance. 

• LSF MultiCluster — suppons resource sharing among multiple LSF 
grids. 

• LSF Make — dispatches tasks to multiple hosts to reduce job-processing 
time. 

• LSF Parallel — manages parallel job execution. 

• LSF JobScheduler — provides fault-tolerant, scalable, calendar-driven 
and event-driven scheduling across server grids. 

LSF MultiCluster provides for workload sharing across grids by linking queues in 
distinct grids together. Each companion product has a price tag associated with it. 

LSF projects a network of heterogeneous computers as a single system. LSF, like 
SGE and SGEEE, does not require any alterations to applications. With LSF, jobs 
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run remotely and behave like jobs run on a local machine. Batch jobs can be run 
automatically as resources become available. Jobs can be suspended and resumed 
based on resource availability. In addition, LSF can run sequential and parallel 
applications and either interaaive or batch jobs. 

Using LSF, administrators can control access to resources such as the following: 

• Who can submit jobs and which hosts they can use; 

• How many jobs specific users or user groups can run simultaneously; 

• Time windows during which each host can run load-shared jobs; 

• Load conditions under which specific hosts can accept jobs or suspend 
jobs; and 

• Resource limits for jobs submitted to specific queues. 

LSF supports job checkpointing, permitting job migration to a better host. It sup- 
ports parallel virtual machine (PVM) and message passing interface (MPI)- Several 
scheduling policies — not to be confused with SGEEE policies — are available for 
managing batch. These policies include preemptive; preemptable; exclusive; first- 
come, first-served; and fairshare. Resources can be reserved when a job is submit- 
ted, guaranteeing that the job will always have the resources it needs whUe execut- 
ing. Or, resources can be gained during job execution, possibly delaying a job's 
progress during execution because it may have to wait until resources that it needs 
become available. 



Job Flow in LSF 

In LSF, a job must be submitted to a queue. A user can name the queue when a 
job is submitted. When a job is submitted without a queue name, LSF examines 
the requirements of the job and automatically chooses a queue from a list of de- 
fault queues. Automatic queue selection is based on user access restrictions (a 
user may not be permitted to submit jobs to some queues), host restriction (queue 
must be configured to send the job to all hosts in its specified list), etc. 

Queues are not tied to hosts; instead, LSF provides a network-wide view of 
queues. Jobs can be moved from queue to queue manually Each time LSF at- 
tempts to dispatch a job, it determines which hosts are eligible to run the job. A 
suitable host is one that has an acceptable load level, has the resource require- 
ments of the job, etc. Each queue has a priority number. Jobs from the highest 
priority queue are staned first. An LSF administrator sets queue priority when the 
queue is defined. Jobs are dispatched for execution by dispatching those in the 
highest priority queue first and then in first-come-first-served order within a 
queue. An administrator can change the order of jobs in a queue. 
LSF is designed to be fault tolerant — that is, grid clusters continue to operate 
even if one or more hosts in the grid are unavaUable. Each LSF grid has a master 
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host that is chosen dynamically. If the current dynamic host becomes unavailable, 
then another host automatically takes over based on the order in which hosts are 
listed in the cluster name file. 

A job goes through a series of states during its execution phase. Most jobs enter 
only three states — pend (waiting in a queue for scheduling and dispatching), run 
(dispatched to a host and running), and done (job completed). A job remains 
pending untU aU conditions for its execution are met. Jobs can also be placed in a 
suspended state by their owners, an LSF administrator, someone with root access, 
or by LSF itself. 
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Message 



About Grid computing 



What would it mean if you could: 

• Analyze the value of an investment portfolio in minutes rather 
than hours? 

• Unite research teams with others around the world to take 
advantage of the most up-to-date learnings? 

• Significantly accelerate the drug discovery process? 
- Scale your business to meet cyclical demand? 

• Cut the design time of your products in half while reducing the 
instances of defects? 

Govmimenl labs and scientific organisations have been using grid Q- 

technologies for several years, solving some of the most complex O 

and important problems facing mankind. Now grid computing is ^ 

becoming a critical component of day-to-day business. Today's ^ 

challenging business climate requires continuous innovation to rQ 

differentiate products and services. Businesses must adjust ^ 

dynamically and efficiently to marketplace shifts and customer S 

demands. ^ 

IBM's response to these customer needs is what e-business on 

demand is all about. There's a profound shift afoot in how computing CO 
is used — even in basic assumptions about how it's accessed and M' 
paid for. Grid computing can bring tremendous productivity and 
efficiency to organizations facing the challenges of an on demand 
world. 



IBM has practical information on grid computing 
Find out what a grid is. 
What is grid computing? 

Learn how IBM uses grid. 
IBM and grid 

Learn about the significant productivity and efficiency gains that 
grid can offer businesses today. 
Grid benefits 

Get answers to fi-equently asked questions for businesses just 
considering grid computing, as well as those taking the next steps in 
unleashing grid power. 
Freouentlv asked questions 
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Abstract 

The Accelerated Strategic Computing Initiative (ASCI) 
• computational grid is being constructed to interconnect 
the high performance computing resources of the nuclear 
weapons complex. The grid will simplify access to the 
diverse computings storage, network, and visualization 
resources, and will enable the coordinated use of shared 
resources regardless of location. To match existing 
hardware platforms, required security services, and 
current simulation practices, the Globus MetaComputing 
ToolkitflJ was selected to provide core grid services. The 
ASCI grid extends Globus functionality by operating as 
an independent grid, incorporating Kerberos-based 
security, interfacing to Sandia *s Cplant™, and extending 
job monitoring services. To fully meet ASCIs needs, the 
architecture layers distributed work management and 
criteria-driven resource selection services on top of 
Globus. These services simplify the grid interface by 
allowing users to simply request "run code X anywhere 
This paper describes the initial design and prototype of 
the ASCI grid, / 



1. Architecture description 

The ASCI grid architecture provides the software 
infrastructure for an integrated information, and simulation 
envirbiunent for the nuclear weapons complex in 2004. 
The Distributed Resource Management (DRNf) project 
provides grid services to higher level applications. The 
initial implementation has two development thrusts. First, 
ASCI requires a core integrated environment for job 
submission to classified ASCI platforms at the three 
weapons laboratories: Los Alamos (LANL), Lawrence 
Ltvermore (LLNL), and Sandia (SNL). Globus provides 
these core services, with some extensions for specific 
ASCI requirements. Second, a software layer above the 
core services provides a set of common capabilities that 



'Sandia is a multiprogram laboratory openited by Sandia Corporation, a 
Lockheed Martin Company, for the United States Department of Energy 
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support higher-level problem solving environments 
(PSEs).. These capabilities include complex work 
management and resource brokering. This architecture 
demonstrates the following features: 

• The use of Globus services for accessing resources " 
in an independent ASQ environment 

• Keiberos-based authentication, grid information 
authorization and accesss control 

• A Globus interface to the two-tier architecture of • 
Sandia's Con^utational Plant (Cplant^, a 
distributed computing resource built of conomodity 



• CORBA-based work management services to 
support the coordinated use of multiple resources 
and complex task sequencing 

• CORBA-based criteria-driven resource brokering 
that extends job request/resource matching to 
support software resources and load-balancing 

• Java-based client for generic work requests 

• Web-based monitoring 

• Integration of the grid infrastructure with CORBA, 
Java, and Globus-based PSEs and tools. 

The DRM grid services address three aspects of the 
shift from platform-centric to network-centric computing: 
software resources, woricflow, and co-scheduling. 
. Software resources serve a large class of users and 
problems where many resources are suitable, and the 
primary interest is how soon the results can be obtained. 
Some simulation codes are run by a number of users at 
different sites, and must be available for many of the grid 
computing resources. Multiple versions of these codes are 
maintained to support different users and to ensure 
reproducibility of results. Requests to **nin code X" are 
satisfied by software resources and brokering. 

Workflow services manage complex task sequencing in 
the network environment, analogous to the complex 
scripts often submitted in a platform environment. 
Workflow can coordinate computational processing steps, 
computation and visualization, computation and data 
movement, or other resource usage. Subtasks are 
scheduled independently. 



Coscheduling of multiple resources will be needed to 
ensure that the critical resources in the ASCI grid achieve 
high utilization, to ensure that high priority jobs obtain 
resources as needed, and to enable new progranuning 
methodologies such as coupled computation and 
visualization. 

2. Related work 

The high performance , distributed computing 
commimity has been researching computational grid 
concepts and advancing the technology, in recent years 
[2]. The NASA Information Power Grid Project [3], 
which is advancing grid research into robust operational 
capabilities, is of particular relevance to the ASCI gbal of 
establishing a grid-based production supercomputing 
environment. Globus [1] is a collection of inirastructure 
services that can be used to construct a grid, including 
services for resource discovery and information, 
monitoring, security based on the Generic Seciaity 
System Application Program Interface (GSSAPI), and 
resource access. Legion [4] is an integrated distributed 
object conqjuting system that provides security, storage, 
persistence, naming, and scheduling. Condor [5] finds idle 
cycles on networked workstations for high throughput 
applications. 

. . . Other researchers focus on software infrastructures that 
build problem solving environments and tools for parallel 
applications on top of grid services. Simulation Intranet 
[6] and ' WebFlow [7] are CORBA-based environments. 
Nimrod^ [8] uses Globus services directly. The use of 
Java for web-based distributed systems is being pursued 
extensively [9, 10]. 

3i Conceptual model 

The ASCI grid architecture model divides needed 
software services into several . layers and partitions as 
depicted in Figure 1 . The model provides a target of the 
envisioned 2004 system to support an evolutionary 
development strategy. It may be abstracted as a three-tier 
model with . domain-specific problem solving 
environments on top, grid infrastructure services in the 
middle, and resource interfaces on the bottom. Seciuity 
and allocation policies must be considered at all layers of 
the model. 

In the top tier, problem solving environments will 
present users with rich sets of tools and services pertinent 
to the problem domain. Users can focus on the scientific 
task, such as a product design or a parameter study, 
without necessarily dealing with system details such as 
data locality,, resource availability, or platform-specific 
commands. Developers of PSEs, applications, and tools 
access the grid sendees layer for the management and 
allocation of high performance conq>uting (HPC) 
resources. 



In the middle tier, grid services provide the software 
infrastructure for accessing geographically distributed 
resources. These services include discovery, scheduling, 
reservation, allocation, monitoring, and control of 
collections of resources. Resource brokering services 
match -resources to iiser requests based on implicit and 
explicit constraints. Users must receive consistent, fair, 
and responsive access to resources regardless of locationi 
while resources must achieve high utilization. 

In the bottom tier, an interface layer connects 
individual resources to the grid! The resources comprising 
the ASCI grid iiiclude heterogeneous confuting, 
communication, storage, visualization, data, and software 
resources. This layer of the architecture isolates resoim:e- 
specific interfaces and commands, providing higher level 
applications with consistent access mechanisms. 

Security services and protection mechanisms are 
needed at all layers of the model. Allocation policies are 
traditionally implemented for individual resources. To 
enable coordinated use of resource ■ sets,, policy 
information and services must be available at higher 
layers. 
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Figure 1. DRIM layered architecture. 
4. The ASCI grid 

A prototype ASCI grid has been implemented in a 
limited tesU>ed to demonstrate proofof-concept. It is 
currently being extended and enhanced as the 
development environment for operational capabilities. 
The production grid, illustrated in Figure 2, will connect 
the laboratories and plants of the DOE weapons complex. 

Initial integration prototypes have interfaced grid 
services to three existing problem-solvirig environments 
shown in Figure 3. . The Simulation Intranet (SI) is a 
product design environment that provides web access to 
simulation, analysis, and visualization tools and manages 
a product-specific data repository. It is implemented in 




Figure 2. Planned ASCI grid. 

Java and CORBA. Focusing on interactive use, the SI 
environment uses the DRM "ran code X anywhere** 
capability to launch jobs on the most lightly loaded 
suitable resource. The Integrated Design for Exploration 
and Analysis (IDEA) environment provides web access to 
optimization tools for simulation codes. Java-based, 
IDEA was integrated with the grid monitoring services.. 
Nimrod/G is a parameter study tool that provides its own 
brokering mechanism and internees directly with Globus. 
It was straightforward to integrate Nimrod-G into the 
ASCI grid; the same can be expected for other Globus- 
based tools. 




Figure 3. PSEs using prototype grid services. 

5. Services for network-centric problem 
solving 

DRM implements a Common Object Request Broker 
Architecture (CORBA) service layer on top of Globus. 
CORBA presents a standard way to componentize legacy 
or enterprise software in a distributed environment. This 
feature allows distributed deployment of work 
management and resource brokering capabilities in the 
grid. In addition, DRM implemented a desktop 
submission tool in Java for users without specialized 



PSEs. The DRM component interactions are shown in 
Figure 4. 

. 5.1. Work manager 

The Woric Manager hides the underiying complexity of 
the grid from scientific users and PSE developers by 
providing common services for resource usage. The Work 
Manager is also an autonpmous agent that negotiates with 
the other DRM components to coordinate resource use 
and executes complex tasks in the ASCI grid on the 
client's behalf The Work Manager accepts job requests, 
or work specification, from DRM clients. This work 
^ecificatibn is represented as an extensible Markup 
Language (XML) document and provides users and PSEs 
with a grid-level "scripting" language. The work 
specification can consist of computations, file migrations, 
resource attributes and constraints, and task sequencing. 
A DRM client application creates a Work Manager on the 
server side and sends the work specification using the 
submit method of the Work Manager's CORBA API, The 
Work Manage generates a query to thie Resource Broker, 
which returns the best match. The Work Manager uses the 
query results to construct a specific resource request in 
the Globus Resource Specification JLang«age (RSL). 
Using the Globus Resource Allocation Manager (GRAM) 
Client API, the Woric Manager submits the request to the 
target resource. The Work Manager . receives status 
i:^)dates and redirected standard out arid standard error 
streams from the GRAM and forwards these to the DRM 
client 

The Work Manager design supports the coordinated 
use of multiple resources. The initial prototype 
implements services for single resource requests and 
simple job management. It is currently being extended to 
support coallocation, serial, parallel, and conditional task 
sequencing. Lead-lag and deadline dependencies will be 
added as scheduling services are incorporated into the 
ASa grid. 




Figure 2. DRM's distributed object archKecture. 



5.2. Resource broker 

The purpose of the Resource Broker is to select the 
"best" resource, allowing the user to specify only the 
nriinimum parameters of interest, the Broker locates the 
set of resources that the user can access and satisfy the 
attributes of -the request From the set of possible 
resources, the Broker chooses one based on a small 
number of possible selection algorithms. With a default 
selection algorithm, it is possible -for the user to merely 
specify **run code X". This contrasts with the common 
practice of selecting a machine or logging in directly, 
\^ere the user must have prior knowledge of the 
resource. It also contrasts with the concept that a service . 
is a software component installed on a resource [6], where . 
again the user must have specific knowledge. 

The Resource Broker first queries the grid information 
service (GIS) for a set of resources that satisfy the 
request. A CORBA interface to the LDAP-based GIS has 
been developed to support the Brok^ queries. The Broker 
then filters tiie set of resources according to the criteria in 
the request. Requests can speciiy a particular resource 
(called a *^vhite pages'* request), or can specify criteria 
that the resource attributes must satisfy (called a 'V^Ilow 
pages" request). The conditional operators (=, <, <=, >, 
>=) can be applied to values for numeric attributes, and 
the logical operators (&, % !) can be applied to combine 
attribute criteria. Any object in the GIS - a person, an 
oiganization, software, hardware - can be brokered. 

Several examples illustrate how request criteria are 
applied. The request **cn=bIue-mountain.asci.1anl.gov>lsf^ 
specifies a particular machine, the ASCI SGI platform at 
Los Alamos, The request "(4S:(&(osname=irix)(freenodes 
>=32))(!(hn=atlantis.sandia.gov)))" matches any SGI 
machine (including Blue Mountain) with at least 32 
available nodes, but specifically rejects Atlantis. The 
request '*sw=Alegra-2D*' identifies all machines that can 
run that code. The Broker treats the user identity as an 
implicit constraint, filtering out resources where the uiser 
is not authorized. 

Once a set of suitable resources is identified, the "best*' 
one is selected. The DRM Resource Broker will provide a 
small number of selection algorithms that will serve most 
requests. An initial prototype algorithm was implemented 
to support interactive use by choosing the least loaded 
resource. This can also provide load balancing. For each 
suitable resource, the Broker calculates a load factor by 
weighting the CPU loads for the previous one, five, and 
fifteen-minute intervals and averaging against the CPU 
clock speed. The maximum load factor corresponds to the 
least loaded resource. The Broker returns to the requestor 
all of the attribute/value pairs in the GIS for this resource. . 

The second selection algorithm is being prototyped for 
a queuing environment by choosing the resource with the 
least average wait time for a given job type. The job type 



maps all possible job requests to a set of characteristic job 
types for which historical average wait time data will be 
maintained. Parameters that affect how long a job waits in 
a queue include the respurces requested, such as number 
of processors and time, and the relative priority of the 
request with respect to competing requests. Requests can 
be prioritized according to the fair, share concept of a 
service ratio (SR). The SR is implemented differently at 
each site, but in all cases is a normalized measure of how 
much of the resource a user has consuined relative to how 
much the user is entitled to. The effects of competing 
requests have been observed in cyclic patterns in the 
workload characteristics for a resource, such as the 
pattern of requests and behavior over a typical week of 
workday and night and weekend shifts. 

For this selection algorithm, the job type will consider 
the following parameters: number of processors, time 
requested, SR, time of day, and day of cycle. For each 
parameter, the possible range of values is divided into a 
set of subranges, or bins. For example, time requested 
could be binned into IS-niinute intervals, SR into tenths, 
and time of day into 1-hour intervals. An individual job 
request maps to a particular 5-tuple of parameter bins. A 
. historical record of the average wait time for each 5-tuple 
will be maintained by the resource. The Broker will get 
the SR for the user, and then the average wait time for the 
job type. The average wait time will be adjusted for data 
movement and non-routine planned outages. The resource 
with the least predicted wait time will* be selected. 

The Resource Broker is being extended to support 
requests for collections of resources. An algorithm for 
closest match selection is also desired. When a resource 
request can not be satisfied, the closest . match selection 
will attempt to identify a resource that is similar to the 
request. 

S3. Desktop submission tool • 

A desktop submission tool was developed for users 
without domain-specific PSEs. This is a Java-based GUI 
for obtaining DRM services. The desktop submission tool 
creates a work manager and sends it a work specification. 
The initial implementation allows a user to input a 
description for reference and to specify or select an 
application. Optionally, the user may input command line 
arguments, request input file staging via text entry or 
browsing, or specify resource constraints. Then the user 
may submit the job request, monitor job status, and 
display standard out and standard error streams that are 
sent back from the work manager. The tool is bcmg 
extended to support requests for new types of resources 
like network bandwidth, collections of resources, and 
complex task sequencing. The initial prototype is shown . 
in Figure 5 and is an example of a physics code ruiming 
after submission to DRM. 
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Figure 3. The desktop submission tool. 



6. Extensions to Globus 

The ASCI environment requires extensions to Globus 
services. First, the Globtis Resource Allocation Manager 
(GRAM) must be extended for computing resources with 
unique features. This ability was demonstrated by. 
interfacing Globus to Cplant™. Second, monitoring 
extensions di^lay node status and job attribute . 
infonnatioa Third, ASCI requires Kerberos-bascd 
security services. 

6j. Cplant™ resource interface 

One of the primary reasons that the Globus toolkit was 
chosen as the foundation of the DRM architecture was its 
ability to connect to existing resource management 
systems. The ASCI grid contains both resource^ for which 
GRAMs exist, and unique local managers that the Globus 
group has not encountered before. Fortunately the design 
has proven robust and accommodates these diflferences 
through a flexible data model. 

The grid data model has two basic aspects that must be 
carefully desijgned and managed. First is the systems view 
of the grid: representations of resource attributes and state 
for efficient operation of grid applications. Second is the 
end user view of the grid: a collection of computational 
resources. These two notions are not always in agreemem. 
One example is the Cplant"^^ architecnu-e. which has 
multiple service nodes (front ends), a unique allocator 
called yod, internal and external network interfaces, and a 
customized version of Portable Batch System (PBS) as a 
local resource manager. 

Because of the size of Cplant""^, the number of users 
accessing it at any given time is large. Load balancing 
user connections today is achieved by cycling service 
node network names from a conunon name using the 



domain name system (DNS). This works well in a login 
enviromnent since the users only have to know the 
common name, but quickly breaks down in a grid 
environment, as DNS will be unreliable due to 
intermediate lookups. Thie goal is to model the resource as 
a single system (the user view), while also describing the 
individual service nodes to Grid applications. In Globus 
this requires changes to the GRAM reporter and site-build 
daemon. Coordination of reporting features will eliminate 
duplicate information being written to the GIS and frul- 
over semantics will pennit continued operation should a 
reporting service node die. 

The yod allocator assigns nodes in an order that Is 
efficient to the topology of its internal Myrinet 
interconnect. However, as a Cplant™ system grows, the 
compute nodes will hot necessarily remain homogeneous. 
Disparate memory sizes, network speeds, and local disk 
space will determine Aether a given set of nodes is 
desirable to the user. To request a suitable set of nodes in 
a heterogeneous system, changes to die node allocator, the 
GRAM site-build, GRAM reporter daemons, and job- 
manager arc required. 

Node allocator changes will expose die differences 
between confute nodes in such a way that it is possible to 
identify the varying attributes associated with each node 
and to specifically request a set of nodes. This will require 
that either a node identifier is bound to each compute 
node in the system, or a class of nodes is defined. At the 
time of job submittal, a set of identifiers or a node class is 
passed to the allocator. This approach also creates 
problems with the Cplant™ batch queuing system because 
the scheduler will have to discriminate which jobs want 
^at nodes and when are they available. ' 

The GRAM sitcrbuild daemon will have to be 
modified to report node configurations. The ability to 
discover node identifiers/classes and their associated 
attributes will allow the site-build daemon to report the 
current Cplant™ system configuration into the GIS, 
making them available to the Resource Broker. The 
GRAM reporter will need to report node consumption for 
brokering and monitoring purposes. Currently, Globus 
only reports the numbers of total and firee nodes in such a 
system. 

Because of changes to the data model, the job manager 
must know that it is running on a service-node and not on 
the abstract DNS definition (user view) in order to report 
the correct global job id back to the GRAM client This is 
accomplished by ensuring that the domain name defined 
at the service node does not .reference any of the other 
service nodes in the system. 

The PBS installation on Cplant™ is also customized. 
Therefore, it is important to incorporate native 
management capabilities, most notably the pingd utility. 
A hook into die GRAM reporter provides information on 
Cplant™ node usage back to the GIS. In addition, because 



Cplant™ PBS uses yod features, the PBS^NODES 
reference in submittal scripts is currently unnecessary. 

These architecture features are not unique to Cplant™. 
They exist in other ASCI resources and any complex 
installation with multiple front ends and heterogeneous 
resources. 

6.2. Monitoring services 

Monitoring services are presently supported through 
two mechamsms; the Heart Beat Monitor (HBM) and the 
myriad number of grid services that report status and 
information to the GIS. The HBM is. a collection of small- 
footprint Unix deamons provided as part of the Globus 
grid computing toolkit. Local monitors on each of the 
grid-enabled resources and the DRM server periodically 
' perform the Unix ps command, parse the output, and 
determine the status of the software services that have 
registered as HBM clients. Each local monitor then 
reports to the HBM data collector (located on the DRM 
server) the status of its clients. The information reported 
by the local monitors is limited to one of four states of 
health. The HBM client is reported as being either active, 
overdue (hasn't shown up in the most recent ps output), 
down, and unknown. 

In contrast to the limited amount of information 
provided by the HBM, the GIS provides a wealth of 
information about the compute resource, queue status, job 
performance, and networic status. Most of this information 
is obtained by the GRAM reporter that runs on each 
resource for each local resource manager (fork, PBS, 
NQS, DPCS, LSF, etc.). Other resource specific 
information is reported by the site-builder, a Globus 
provided Unix daemon that reports relatively static 
resource data. End-to-end network bandwidth and latency 
are reported by gloperf, another piece of Globus software 
that provides a grid-enabled interface to netperf, an open 
source network performance monitor. 

All of the above monitoring data is available to the 
user through a collection of web pages that access, parse, 
and present the data using CGI scripts which run on a 

Node Status for servite-lisandia gov^fork 




Figure 4. Web-based node status monitoring 



secure web server. The use of a web-based interface 
allows for the easy modification of how the information is 
presented. Figure 6 shows an example of CGI script 
customization that was used to display individual node 
status of ASCI's Cplant Colored balls represent compute 
nodes. Yellow nodes are allocated, and display job 
information when the mouse cursor is placed over the 
node representation. 

63. Security services 

The ASCI grid must support classified computing in 
conq)liance with various Federal and DOE regulations. 
The grid services must communicate securely over 
multiple networks and between numerous compute 
resources. In the ASCI grid, Kerberos Version 5 is the 
primary mechanism for authentication. Kerberos and 
several other mechanisms are also used for protecting 
(encrypting) the data as it moves from grid service to grid 
service. The Sandia developed Generalized Security 
Framework (GSF) will be used to provide an. abstraction 
layer over the Kerberos and other security libraries [11]. 
Both Java and C+-(- plications can use the GSF. 

Users can connect via Secure Shell (ssh) to the DRM 
server or run grid enabled PSEs, frameworics, and 
applications, from their desktops. Gridnenabled software, 
such as the DRM desktop submission tool, will establish 
authenticated and protected connections between user 
work stations and the DRM server using GSF secured 
Internet Inter-Orb Protocol (IIOP) connections. Users can 
also access a secure web server, to monitor the status of 
grid resources and submitted jobs. 

The DRM Work Managers, Resource Broker, 
Information Service, and GRAM Clients will initially all 
be located on the same server. Connections between these 
services will be authenticated using Kerberos. The GIS 
will also be located on this server. Access to the GIS data 
must be authenticated and authorized. Access to die GIS 
LDAP server will be authenticated using a Netscape 
Directory Service (NDS) plug-in that provides a Simple 
Authentication and Security Layer (SASL) mechanism for 
authentication using GSSAPI over Kerberos. Access will 
be authorized using standard LDAP Access Control 
Instructions (ACIs). 

Connections between the GRAM Clients and GRAM 
Gatekeepers will almost always occur across one or more 
network connections. Authentication will be 
accomplished using GSSAPI over Kerberos. Data 
protection will be achieved by using the Secure Sockets 
Layer (SSL) protocol. The remaining GRAM processes 
will use authenticated inter-process connections. Data * 
transfers using the Kerberos authenticated and SSL 
protected Globus Access to Secondary Storage (GASS) 
will also be supported. 



7. Current status 

An initial ASCI grid exists in a Tri-Lab testbed, and 
prototype grid services were demonstrated at SC99. The 
initial grid services include a .core Globiis-based 
environment for job submission and monitoring, and a 
layer of CORBA-based services, such as work 
management and resource brokering that support 
network-based problem solving methodologies. Current 
activities are focused on the security and robustness 
needed for production use. Fully integrating Kerberos into 
Globus is needed to obtain DOE security accreditation for 
use in classified networking environments. The 
robustness of the grid information service is being 
enhanced through replication and referral. DRM is 
scheduled to be deployed on ASCI White, the new 10 
Tops IBM supercomputer at LLNL, in November 2000. 

8. Future work 

DRM capabilities that provide sophisticated services 
for network-based problem solving will be added to the 
ASCI grid. Advance reservation, coscheduling, and 
workflow will allow scheduling and coordinated access to 
storage systems, network bandwidth, and visualization 
resources. A scheduler component will be added to the 
DRM architecture to realize grid-level scheduling. The 
Work Manager, Resource Broker, and CIS must also be 
extended to support reservations and requests for 
collections of resources. DRM will continue . to 
collaborate with PSEs and participate in the Grid Forum: 

9. Conclusion 

DRM is developing and deploying a computational 
grid to support the ASCI mission. The Globus resource 
management infrastructure provides a highly 
sophisticated and extensible mechanism for adding and 
exploiting new and existing ASCI resources. The grid's 
CORBA layer on top of Globus aids . scientists and 
problem solving environments by simplifying job 
management and resource discovery. All grid services 
communicate securely and access to the grid information 
service is access controlled. With these features, the ASCI 
grid will offer in^roved utilization of geographically 
distributed HPC resources in the Tri-Lab complex. 
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Abstract 

Heterogeneous networks are increasingly being used as platforms for resource-intensive 
distributed parallel applications. A critical contributor to the performance of such ap- 
plications is the scheduling of constituent application tasks on the network. Since 
often the distributed resources cannot be brou^t under the control of a smgle global 
scheduler, the application must be scheduled by the user. To obtain the best perfor- 
mance, the user must take into account both application-specific and dynamic system 
information in devdopuig a schedule which meets his or her performance criteria. 

In this paper, we define a set of principles underlying application-level schedul- 
ing and describe our work-in-progress building AppLeS (application-level scheduling) 
agents. We illustrate the application-level scheduling approach with a detailed descrip- 
tion and results for a distributed 2D Jacobi application on two production heteroge- 
neous platfotms. 

1 Introduction 

Fast networks have made it possible to coordinate distributed CPU, memory, and storage 
resources to provide the potential for application performance superior to that achievable 
from any single system (Ij. Parallel applications targeted to such systems are typically 
resource-intensive, i.e. they require more resources than are available at a single site [16]. 
Critical resources may include large aggregated and distributed memory;, fixed data sources, 
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local temporary storage, and computational cycles. Performance is defined by the user, 
and may mean different things for different applications, however achieving it requires the 
efficient use of all relevant resources. 

Despite the performance potential that distributed systems offer for resource-intensive 
parallel applications, actually achieving the user*s performance goals can be difficult. One 
of the most fundamental problems that must be solved to realize good performance is the 
determination of an efficient schedule. Effective scheduling by the application developer or 
end-user involves the integration of application-specific and system-specific information, and 
is dependent on the dynamic interactions between an application and the relevant system(s). 

Currently, the performance-seeking end-user must develop schedules for distributed het- 
erogeneous applications off-line, using intuition to predict how the application will perform 
at the time it will execute. The users or application developers must select a configuration 
of resources based on load and availability, evaluate the potential performance of their apph- 
cation on such configurations (based on their own performance criteria), and interact with 
the relevant resource management systems in order to implement the application. At the 
same time, other users (running their own applications) draw from the same set of resources, 
each seeking to achieve his or her own performance goals. When multiple users contend for 
resources, only a fraction of the resource performance can be delivered to each. 

In this paper, we describe an application-specific approach to scheduling in- 
.dividual parallel applications on production heterogeneous systems. We are de- 
veloping software to facilitate and improve upon the scheduling activities of the user. Our 
goal is to develop scheduling agents that perform this task for the user at machine speeds 
and with more comprehensive information. We term these agents AppLeS - Applicatiox^ 
Level Schedulers. Each appUcation will have its own AppLeS to determine a performance- 
eflicient schedule, and to implement that schedule with respect to the appropriate resource 
management S3^tems. 

Note that AppLeS is not a resource management system; rather, it interacts with systems 
such as Globus [3, 11], Legion (12, 17], or PVM [9, 20] to perform that function. As such, 
AppLeS is an application-management system which manages the scheduling of the 
application for the benefit of the end-user. 

In the next subsection, we describe our approach for AppLeS. 

1.1 Scheduling from the Perspective of the Application 

Application-level scheduling is based on four underlying principles: 

• Application- and system*specific information is needed for good schedules* 

Users determine good schedules for their applications based on their perception of 
system capabilities, and their knowledge of the structure and requirements of their 
application. The frequency of communication and computation, the amount of memory 
required, the number, type, and size of application data structures are matched with 
the granularily of the computational platforms, network speed and bandwidth, and 
other system attributes to develop a performance-efiScient schedule. 
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• Dynamic information is necessary to determine system state. 

Users base candidate schedules on knowledge of which machines are available and 
which are heavily or lightly loaded. This load varies over time and with usage of 
system resources. If a choice of networks or computational platforms is available, the 
user will combine his/her knowledge of how the application will use the system with 
the current or predicted load on its resources. 

• Good schedules involve some prediction of application and system perfor- 
mance. 

Prediction provides the.basis for most scheduling. The user predicts how their appli- 
cation will execute on the system and uses this prediction to choose a performance- 
efficient schedule. Such predictions are difficult to make accurately since the systern 
. . varies over time due to contention, and application performance may be dependent on 
both data and system load. However, simplifying the model of the system or applica- 
tion excessively to make the prediction task easier is not always fruitful. Optima for 
a simplified model may not correlate with optimal behavior in practice. In particular, 
application and system models must be sufficiently complex to expose real phenomena. 

• All resources can be evaluated strictly in terms of the performance they 
deliver to the application. 

Notice that, from the perspective of the appHcation (or user),, each resource is judged 
ultimately on how much it benefits the application's execution. Users define different 
criteria for performance (speed, cost, etc.), but the decision about which resources to 
use, and when to use them, is based on how they will perform (in terms of the specific 
criteria) when executing the user's application. 

The AppLeS approach is to use parameterizable application- and system-specific models 
to predict application performance using a given set of resource. Using th^ models in 
conjunction with forecasts of expected resource load, an AppLeS agent can select a resource 
set aod an application schedule by evaluating various candidate mappings. The mapping that 
generates the best expected performance is chosen and implemented on the target resource 
management system(s). 

Note that a fundamental difference between the AppLeS approach and system-oriented 
schedulers is that for AppLeS, everything about the system is experienced from the 
point of view of the application. If the candidate resources for the application are Hghtly 
loaded, then the system appears lightly loaded to the application regardless of the load on 
other resources. If the candidate resources are heavily loaded, then the system appears 

heavily loaded. v j i 

In the next section, we utilize the apphcation-level scheduling approach to develop an 
efficient schedule for a distributed Jacobi data-parallel code. The example serv^ as a "proof 
of concept" for the principles underlying the AppLeS approach, and serves to illuminate the 
components required for general application-oriented scheduling agents. After discussing the 
Jacobi example in detail, we will describe our current efforts to build general AppLeS agents 
for scheduling in Section 4. 
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Figure 1: Five-point Jacobi Computation 



2 Application-Level Scheduling of Jacobi2D 

In this section, we illustrate and motivate our approach using a simple application. We dis- 
cuss the development of an application-level schedule for a distributed 2D Jacobi application 
in detaO and present performance datiau . 

Consider the problem of executing a distributed datarparallel two dimensional Jacobi 
iterative solver (Jacobi2D) using a heterogeneous network of machines. The Jacobi method 
is commonly used to solve the finite-difference approximation to Poisson's equation [15] which 
arises in many heat flow, electrostatic, and gravitational problems. Variable coefficients are 
represented as elements of a two-dimensional grid. At each iteration, the new value of each 
grid element is defined to be the average of its four nearest neighbors during the previous 
. iteration (see Figure 1). 

TypicaUy, the Jacobi computation is parallelized by partitioning the grid mto rectangular 
regions, and then assigning each region to a different processor. This decomposition strategy 
is favorable because a processor need only obtain the border elements for its region during 
each iteration. The amount of computational work scales as the area of each region, whereas 
the amount of delay due to communication scales as the perimeter. A small number of big 
regions will yield good processor efficiencies, but may sacrifice parallelism. Conversely, a 
large number of small regions may incur large communication overhead. In our example, the 
user wishes to identify a partitioning that yields the lowest possible execution time. Solving 
the partitioning problem optimally is NP-complete, so it is necessary for the user to employ 
heuristics to arrive at a "good" solution. 

2.1 Deriving Partitions that Optimize Resource Performance 

The version of Jacobi2D we use in this example is written in a data-parallel SPMD style 
using KeLP [6, 5]. The KeLP system provides high-level abstractions, in the form of 0++ 
objects, that support runtime data decomposition. In addition, the details associated with 
message passing in distributed-memory computing environments are buried in the abstrac- 
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Figure 2: Strip data partitioning for.three processors; 
processor Pi or Pa- 



where processor P© is twice as fast as 



tions making the code portable and easy to maintain. 

An ideal partitioning will assign regions of the Jacobi grid to processors such that the 
area of each region matches the performance capability of the processor to which it is as- 
signed. Faster processors should compute over larger regions than slower ones. In particular, 
computational time is optimized when the ratio of each rectangular area of th^ grid to the 
total grid area most closely matches the ratio of the power of the processor to which the 
rectangular area is allocated to the total processing power avaQable. 

However, it is not simply a processor's computational time that defines itis performance 
capability for jacobi2D. The performance capability of each processor depends on how fast 
each processor can locally compute an element of the Jacobi matrix, ajad how quickly each 
processor can communicate its border elements with its neighboring processors. These two 
factors most dramatically affect execution time of this application. 

To derive partitions that balance resource performance, we formulate the partitioning 
problem as an analytical model. Let 

Ti = time for processor i to compute region i 

Ai ^ the area of region i 

Pi = the time required for processor i to compute a single point locally 

d — the time for processor i to send and receive its borders 

for i in I regions and processors. The time each processor spends computing and com- 
municating during a single iteration of Jacobi2D can then be represented as 
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This equation predicts the execution time (including the time spent communicating) 
for each processor. If all partitions are scheduled simultaneously, then the execution time 
for a single iteration will be equal to the maximum value of T^. We can balance the time 
. each processor, spends computing and commiinicating by setting all Ti equal and solving the 
resulting system of equations for Ai. For a grid with N rows and M columns, let 

Ti=r2=r3 = ... = 7> . (1) 

/ " .*...• 

•=1 

We restrict the legal partitions to those which only consider a single dimension (i.e. 
strip partitions, shown in Figure 2), so that d does not depend on Ai. For this type of 
partitioning, the system of equations (1) is linear and can be solved quickly by conventional 
methods. 

For a strip partitioning, we let 

r Recv{i - -1, i) + Recu{i -^ 1, i) + Send{i, i ~ 1) + Send{i, % + 1) for t > 2 and i<(I- 1) 
Ci = < Recv(i-\-l,i)^Se7id{i,i'hl) ^ for i = 1 

[ Recv{i - 1, i) + Send{i, i - 1) for t = / 

where Recv{iJ). = time to receive N elements from processor i on processor j 
5end(i, j) = time to send N elements from processor i to processor j 

N = number of elements in the dimension not being partitioned 

We.can solve the linear system of equations (1) in 0{P) by simple Gaussian elimination 
for 6ach Ai. Note, however, that there is no guarantee that each Ai corresponds to an integral 
number of columns (or rows). To complete the strip decomposition, w:e must then round the 
partitions accordingly. 

Observe that an alternative, but computationally more complex, solution is to formulate ■ 
the problem as a constraint-based minimization problem. Linear programming techniques 
can then be used to derive the partitions. This approach is viable, however in the interest 
of rapid prototyping, we chose to adopt the simpler linear systems formulation. 

2.2 Predicting System State with the Network Weather Service 

To solve the Unear system of equations (1), we require as parameters the time requh-ed to 
send and receive N elements from each processor to its neighbors (Send(iJ) and Flecv(iJ)), 
and the time required to compute a single element on each processor (Pi). 
We can model the send and receive times as 

Send{iJ) = N * sizeof {element) /Bandwidth{iJ) 
Recv{iJ) = N * sizeof (element)/ Bandwidth{j,i) 
where Bandwidth{iJ) = data rate supported by the link between i and j 
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. Note that N and sizeof(element) are both time-invariant parameters of the problem being 
solved. Similarly, we can model the per point compute time on each processor i as 

Pi - PUnloadedi/CPUi where . 
PUnloadedi = the time to compute a single point on an unloaded processor i, and 
CPUi = the percentage of time processor i spends executing partition i 

These quantities will vary over time due to resource contention: Bandwidth(ij) will be 
defined (in part) by the volume and frequency of traffic crossing the link from i to j. CPUi 
will depend on the number of additional processes executing on processor i, and the way in 
which each CPU is managed. Typic^Jly, if the system is time shared, the percentage of time 
a CPU is devoted to any one job is some "fair share" of the total CPU time; however, that 
share will diange as jobs enter and leave the system. 

Moreover, the estimates of Send(i j), Recv(io), and Pi must be accurate at the time the 
application will be scheduled which is not necessarily the time at which the partition 
is derived. The scheduler, therefore, requires a forecast of the values of Send(iJ), Recv(ij), 
and Pi for the time frame in which the application will execute. 

We have developed a separate facility called the Network Weather Service which 
dynamically supplies values and forecasts for CPUi for all i, and Bandwidth(i j) for all i and j 
in a networked system. The Network Weather Service is outlined in Section 4. For Jacobi2D, 
the Network Weather Service used dynamic probes and load history to help forecast CPUi 
and Bandwidth(iJ) at the time the application was to be scheduled* 

2.3 Resource Selection and Scheduling 

Resource selection focuses on the identification of a subset of resources that most eflBciently 
support the application. Most users naturally focus on resources they perceive as being 
"close". For the Jacobi application, we can formally define the logical "distance" between 
resources and prioritize a resource set based on this metric. Note that distance between 
resoiirces is meaningful to the application only in terms of how the resources will 
be used. Recall that for a given grid region of size N^, the computation in eadi partition 
scales as 0{N^) and the comniunication scales as 0{N). We can use this relationship to 
define the distance between processors for Jacbbi2D. Let 

the forecast time required for processor i to compute a single point locally 
the forecast time for processor i to send and' receive a single element to 
and from processor j 

D{iJ) = N'^*m''P:\)'^N*iCE{iJ) + CE{j,i)) 

defines a distance measure between processors i and j for a arbitrary problem size N. Two 
processors are near to each other in Jacobi2D if their compute capabilities are relatively 
equal, and if their interprocess communication is fast. 



Pi = 
CE{iJ) = 

Then 
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To select resourees from the global resource pool, we start by identifying a candidate 
machine to serve as the locus. For example, the user's machine or the fastest machine in a 
cluster may serve as the locus. The rest of the machines are then sorted according to their 
distance D from the locus. Note that different orderings may be determined for distinct loci. 
The first K elements of the sorted list for a particular locus L are defined to be the "closest" 
resource set to L containing K machines. 

For Jacobi2D, the workstation with the fastest CPU was used as one such locus. We 
then used the algorithm in Figure 3 to determine a candidate resource set. 



let head = locus 
let tail » locus 
foriinltoM 

find the machine m such that D(tail^) is a 
minimum and m is not already on the list 

add m to the tail of the list 

let tail -m 

end 



Figure 3: Prioritizing the resources based on "distance". 



let locus = machine having the maximum criterion value 
let list - a sort of the remaining machines according to 

their logical distance 
forkinOtoI-l. 

let s - {locus + the first k elements of list} 
parameterize C J and P_i for 1 <= i <= |S| with 

Weather Service forecasts 
solve linear system of equations using this parariietcrization 
ifl;notallA_i>0) 

reject partitioning as infeasible 
else ifi[there exists an AJ that does not fit in ftee memory 
of processor i) 

reject partitioning as infeasible 
else record expected execution time for subset S 

end 

implement, the partitioning corresponding to the minimum 
execution time using the S for which it was computed 



Figure 4: Resource selection and scheduling algorithm for Jacobi2D.. 

The algorithm iteratively finds the machine that is closest to the current tail, and adds 
that machine at the tail end of the list. After all I machines have been added, the algorithm 
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terminates with eaxA machine logically closest to those adjacent to it in the list. This form 
of sorting is useful for a strip decomposition of Jacobi2D as processors only communicate 
with at most two neighbors. 

Having derived the resource list, the Jacobi2D scheduler then proceeds to comparie dif- 
ferent potential partitionings using subsets of the total list. It starts by estimating the 
execution time on the locus machine. Next, it considers a two processor partition using 
the first two processors on the list. It parameterizes the linear system of equations for 1=2 
processors, and consults the Network Weather Service for the performance forecasts that 
pertain to those two machines. After solving the linear system, it records the estimated 
execution time of the resulting partition. A three processor partitioning using the first three 
processors from the sorted resource list is considered next. The estimated execution tinie 
for the three processor system is recorded, arid the algorithm continues until all processors 
from the list are considered or some predefined maximum logical distance from the locus is 
reached. Finally, a processor set and a partitioning and schedule yielding the minimum esti- 
mated execution' time are chosen as the "best" schedule for that locus. Note that when 1=1, 
the Resource Selector considers a single-site implementation. In our example, the single-site 
implementation is simply a sequential version of the KeLP implementation. If an optimized 
implementation for a particular system were available, the Resource Selector could consider 
that as well. 

Each time a partition is generated in the process, it is diecked for feasibility. Two filters 
are employed to remove infeasible partitions from those ultimately considered for scheduling. 
The first filter removes partitions that have negative values of A- These correspond to 
mappings where the communication time is so great, the processor must compute a negative 
number of elements (implying a negative execution time) in order to finish with the other 
processors. The second filter checks to make sure that the size of each partition fits within 
the free memory (forecast by the Network Weather Service) available on the machine to 
which it is assigned. 

The resource selection and scheduling method used by our example Jacobi2D scheduler 
can be summarized by the pseudocode in Figure 4. 

2.4 Scheduling Jacobi2D and the AppLeS PrincipLeS 

The scheduling approach we have described for Jacobi2D uses the principles outlined in 
Section 1.1 and in fact is an example of an AppLeS. Application-specific and system-specific 
information are used throughout the scheduler, both to generate schedules and to select 
resources. Dynamic system information is provided via the Network Weather Service to 
parameterize performance models. Predictive models are used to evaluate and rank candidate 
schedules. Finally and perhaps most important, all resources are considered strictly in terms 
of how they affect application performance. 

Using this application-level approach to scheduling, the natural question becomes "How 
performance-efiicient is the schedule that it generates?" We describe experiments which 
address this question in the next section. 



9 



3 Performance Results for the Jacobi2D Application- 
Level Schedule 

To determine the effectiveness of the application-level scheduling approach, it is important 
to answer the following questions: 

• How does the execution time of Jacobi2D using an AppLeS schedule compaxe to a 
schedule determined using a widely-accepted conventional method? 

• What is the effect of using dynamically forecast resource performance , data in the 
application-level scheduling approach? 

. What is the effect of automatic resource selection in the application-level scheduling 
approach? 

To address these questions, we compared four partitioning methods for the same KeLP 
implementation of Jacobi2D. The first method [Compile-time blocked] uses a conven- 
tional HPF-style [14] block partitioning in which each processor is assigned (at compile-timej 
a relatively equal-sized square region of the grid to compute. The other three partitiomng 
methods utilize versions of the application-level scheduling approach descnbed m the pre- 
vious section. Partitioning method 2 (CompUe-time AppLeS) uses good static estmiates 
for resource performance and uses resource selection to select a resource set &om the total 
resources. Partitioning method 3 (Runtime AppLeS/No Select] uses dynamic estimates 
from the Network Weather Service for resource performance but assumes that the user wants 
to use all available resources. Partitioning method 4 (Runtime AppLeS] uses dynamic es- 
timates and resource sdection - it constitutes the full application-level scheduling a.pproach 
discussed in the last section. Note that partitioning methods 3 and 4 utilize Network Weather 
Service date and so must be performed at run-time, whereas partitionmg methods 1 and 2 
use stetic data and may be performed at compile-time. . 

All four versions first partition and distribute the grid, and then execute the Jacobi solver. 
That is the data and computations are scheduled on the processors once before execution 
begins 'and remain there for the duration of the execution. We are currently formulatang 
a v^on of the Jacobi application-level scheduler which effectively redistributes the grid in 
response to changing load on system resources, this flexibiKty is supported in the AppLeS 
software described in the next section. 



3.1 Execution Performance 

lb investigate the relative executi6n performance of the four partitioning methods, we used 
eight non-dedicated workstations located at the San Diego Supercomputer Center (SDSC) 
and the U.C. San Diego Parallel Computation Laboratory (UCSD-PCL). The wortetation 
set consisted of a Sun Sparc-2, a Sun Sparc-10, and two IBM RS6000 workstations located 
at UCSD and four DEC Alpha workstations located at SDSC Numeric format conversions 
were handled by KeLP which uses MPI as its underiying communication substrate. The 
network connecting these systems was also heterogeneous and non-dedicated. Withm the 
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PCL, the Suns were attached to an ethernet segment shared by several other systems. The 
RS6000S were connected to a different segment (also shared by other ambient machines) and 
a gateway which Unked the two segments. At SDSC, the Alpha. workstations were connected 
to non-dedicated FDDI ring. The configuration is shown in Figure 5. 



UCSD SDSC 




Figure 5: Workstations and Networks used at UCSD and SDSC. 



All systems and networks were shared and used in "production mode" while we ran our 
experiments. Since conditions might change between one execution and the next (due to 
contention) we made several runs for each problem size, and reported the average execution 
time of a single iteration. During each experiment, we ran one instance of each of the 
four partitioning methods back-to-back hoping that all four executions would enjoy snnilar 
conditions, on average. Figure 6 shows the average iteration execution times (in seconds) 
for a range of p^roblem sizes. In each case, a square grid having the problem size dimension 
shown in the figure wais used. 

In the experiments, application-level scheduling is able to outperform the block partition- 
ing because it uses its performance model to predict how well each resource will perform 
when executing a piece of Jacobi2D. It uses that prediction to determine how much of the 
grid should be assigned to each machine. Notice also, that the benefit gained from using 
dynamic performance forecasts is substantial. Less obvious, however, is the improvement 
gained through resource selection. While the version that used resource selection does run 
between 25% and 50% faster than the non-selecting runtime AppLeS, the relative improve- 
ment compared to the blocked implementation is not large. However, the range of feasible 
partitions for the non-selecting runtime AppLeS is limited. For example, under the con- 
ditions during which the experiments were conducted, it was not possible to balance the 
execution time for a 500 by 500 element problem: the communication delay between UCSD 
and SDSC was so great that processors in either end would need to compute for a negative 
amount of time to compensate. 



1-1 
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In Figure 7 we show execution time data for a wider range of problem sizes using Compile- . 
time Blocked, and the full AppLeS partitioner. Without resource selection, AppLeS would 
only be able to compute reliably (depending on contention conditions) over the 1000 to 2000 
problem size domain. We also show the predicted execution time AppLeS computed for 
each run. For each problem size, we plot the time that the performance ndodel predicted 
against the actual execution time that resulted for each mapping. It is the accuracy of the 
performance model that allows AppLeS to choose good resource mappings. 



Comparison of Execution Tunes 
16 _ r — 1 1~ 




Figure 7: Execution times for Jacobi2D. 



Note also the large spike in execution time for the blocked partitioning at the 1900 
problem size abscissa. During one experimental run at that size, a network gateway be- 
tween UCSD and SDSC went down forcing all communications between the two to use an 
alternative and much slower route. The AppLeS agent (through Network Weather Service 
readings) was able to detect the sudden drop in available bandwidth and avoid partitionings 
that spanned the affected link. 

3.2 Partitioning for Memory Availability 

Distributed parallel execution also allows an application to aggregate memory resources so 
that problems that are larger than will fit into any single memory may be solved. Indeed, 
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the motivation behlud the parallel implementation of many codes stems from the need to use 
collections of memory systems rather than a desire for concurrent execution. To investigate 
the ability of the AppLeS approach to effectively aggregate memory, we added to the resource 
pool two IBM SP-2 processors with 128 megabytes of reial memory each. The SP-2 uses 
virtual memory on each of its nodes so that more than 128 megabytes of memory may be 
used. However, memory is paged to disk pausing reference times to increase dramatically 
when the real memory of the system is exceeded. During the experiments, we had dedicated 
access to the two SP-2 processors and the link between them, but they were connected to 
the rest of the resources via a shared ethernet segment. Figure 8 shows the resource pool 
including the SP-2 nodes: . 



UCSD SDSC 




Figure 8: Resource Pool Including SP-2 Processors. 



Since the processors were completely unloaded, and their connectivity to the other re- 
sources suffered from contention, the best partitioning (yielding the shortest execution time) 
was to split the grid evenly between the two SP-2 nodes as long as neither partition exceeded 
the available real memory on each node. However, when the problem size caused the par- 
titions to spill out of the available real memory, the resulting delays due to paging caused 
execution time to increase substantially. In Figure 9 we show the execution time of a blocked 
partitioning using the SP-2 processors only versus the AppLeS approach for Jacobi2D. 

For problem sizes less than 3900 by 3900, AppLeS correctly chose the mapping using the 
SP-2 processors and exhibited nearly identical execution times to the blocked mapping. As 
problem size increased, the SP-2 began paging, causing execution time to increase to the 
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point where use of these processors was no longer feasible. The AppLeS agent was able to 
locate memory elsewhere within the resource pool effectively. At each problem, size beyond 
3900, the AppLeS was able to find memory it could use effectively without a dramatic change 
in the performance trajectory. 

Thus far we have shown how the AppLeS approach was used effectively to determine 
a performance-eflScient (and non-obvious) schedule for Jacobi2D. It was importaiit to walk 
through this example in detail to demonstrate this approach. We now discuss how the 
AppLeiS approach can be used as the. basis for the design of general software agents which 
facilitate application-level scheduling for distributed parallel applications. 

4. Developing General AppLeS Agents 

It is clear from the previous sections that application-level scheduling can be used effectively 
to achieve performance for distributed applications. However, to develop general AppLeS 
agents, we must convince biuselves that the following questions could be answered in the 
affirmative: 

• Is the appUcation-level approach for selecting a' perfonnance-efficient schedule, gener- 
. alizable? . . 

• Is it possible to efficiently obtain the appropriate level of application and system infor- 
mation (from the user or through analysis) from which good schedules may be derived? 

To address the first question, observe that in the development of the application-level 
schedule for Jacobi2D,, our approach did not rely particularly on the choice of algorithm, 
implementation language, or programming style for success. The organization of the AppLeS 
software mimics how a diligent user would schedule his or her application. The characteristics 
of the application are relevant only as they pertain to modeling its performance. In AppLeS, 
we modularize application-specific, system-specific and dynamic information and \ise this 
information to parameterize the general approach. 

To address the second question, we developed a set of data sources to. provide the rde- 
vant application- and system-specific information efficiently. The Network Weather Service 
was designed to provide dynamic system information and short-term forecasts. Application- 
specific information is provided through a Heterogeneous Application Template (or HAT) 
which distills much of the information from the application relevant to performance estima- 
tion. Additional information which reflects the user's preferences, access to resources, etc. 
is provided by User Specifications. Note that for AppLeS, as in practice, the more complete 
the application information that is available to the scheduler, the better the schedule. 

AppLeS is currently a work-in-progress. The software has been designed and the under- 
lying building blocks are currently being prototyped. We are working with researchers from 
the Legion project [12], [17] and from the Globus project [3], [11] to prototype AppLeS as 
an application-level scheduler for these resource management systems. In addition, we are 
progressing on an implementation which uses MPI as the underlying substrate. 

Note that AppLeS esisentially develops a customized scheduler for each application. This 
differs from the approach taken in much of the scheduling literature ([21], [13], [.19], [23] [7] 
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etc.). Application-level scheduling is related to the work of Brewer (2], and more directly to 
the Mars project [8]. Brewer's work, which attempts to select the correct implementation of 
an algorithm for a given machine based on a sinall .set of static parameters, uses application- 
specific information to improve performance. TTie MARS project [8], whose goal is to produce 
more general-purpose software, is more similar in scope and intent to AppLeS, An important 
difference, however, is that AppLeS includes user-specific as well as application-specific 
information in its scheduling decisions. User-specific information provides a powerful and 
well-defined interface that allows the user to influence and control how the scheduling agent 
will behave. 

In the following sections, we describe the architecture for general AppLeS agents. 
4.1 The AppLeS Organization 

AppLeS is organizefd in terms of four subsystems and a single active agent called the Coor- 
dinator. The four subsystems are 

• The Resource Selector which chooses and filters different resource combinations for 
the application's execution, 

• The Planner which generates a description of a system-independent schedule from a 

resource combination, 

• The Performance Estimator which generates a performance estimate for candidate 
schedules according to the user's performance metric, and 

• The Actuator which implements the "best" schedule on the target resource manage- 
ment system(s). 

. Figure 10 depicts the Coordinator and these four subsystems. Application-specific, system- 
specific, and dynamic information used by these subsystems constitute an "information pool" 
which all subsystems share. There are four general sources of information feeding the infor- 
mation pool. The Network Weather iService provides dynamic information on system 
state and forecasts of system state for the time frame in which the application will be sched- 
uled. The Heterogeneous Application Template is a web-oriented interface in which 
the user provides specific information about the structure, characteristics and current imple- 
mentations of the application and its tasks. The User Specifications provide information 
on the user's criteria for performance, preferences for implementation, additional application 
information, etc. Finally, the Model pool provides model templates used by the AppLeS 
subsystems for application performance estimation. 

AppLeS agents will be employed as follows: Initially, the user provides information to 
the agent via the HAT and User Specifications. The agent uses the Resource Selector to 
select a set of viable resource configurations based on accessibility, the user's access rights, 
the characteristics of the application (input as filters which exclude resources that are not 
viable), and a notion of "distance" which is derived from HAT information and the Model 
pool, or provided as a default by the Coordinator. For each viable resource configuration, the 
Planner (in conjunction with the Performance Estimator and the Network Weather Service) 
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Figure 10: Relationship of the components of AppLeS. 



computes a potential schedule of the resources using predictive models from the Model pool. 
The CJoordinator considers the performance of the candidate schedules and selects a "best" 
schedule for implementation. The Actuator then interacts with the resource management 
system(s) to implement this schedule. 

In the following subsections, we describe each of the components of AppLeS agents in 
more detail. 

4.2 The Coordinator 

The Coordinator embodies the active thread or threads of control within an AppLeS agent. 
It executes a blueprint that dictates the way in which it uses the various other subsystems 
to derive a schedule, initiate the application, and monitor its progress. The blueprint can be 
specified by the user or by the system for a particular application or class of applications (e.g. 
data parallel applications). We show a sample blueprint in Figure 11. This is typical for a 
user scheduling a minimum execution time application over a large set of possible resources. 
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Figure 11: Cloordinator and Blueprint 



and is, in fact, the blueprint used to schedule Jacobi2D. 
4.3 Data Sources 

While the Coordinator directs the interactions between subsystems through its blueprint, 
each subsystem draws upon a variety of data sources to perform its function. These data 
sources contribute information to a data pool which is available to all AppLeS functionis 
(see Figure 10). They are the Network Weather Service (NWS), the Heterogeneous 
Application Template (HAT), the User Specifications, and the Model pool. In this 
section, we briefly describe the form and content of each. 

4.3.1 The Network Weather Service 

The Network Weather Service provides software for monitoring and predicting the load (or 
"weather") on networked resources. Our strategy is to use sensors to dynamically probe and 
read the network "weather" conditions such as CPU load, available free memory, network 
performance, etc. 
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To provide forecasts of system state, the Network Weather Service uses a number of 
stochastic techniques for predicting network load. Experiments using different network links 
and predictors show that, in general, for a given resource, different estimation techniques will 
yield the best forecasts at different times. C3onsequently, the Network Weather Service tracks 
the error between all predictors and sampled data, and uses the predictor with the lowest 
cumulative error to make predictions of system state. Both the prediction and a measure of 
its recent "accuracy" are used by the Resource Selector, the Planner, and the Performance 
Estimator subsystems of an AppLeS agent. 

.We have prototyped this facility with good results as shown in the Jacobi2D example. 
We are currently integrating Network Weather Service facilities with the Legion and Globus 
resource management systems. 

4.3.2 The Heterogeneous Application Template 

The Heterogeneous Application Template (HAT) provides basic information about the over- 
bH application, tasks and implementations in terms of their resource requirements. Infor-. 
mation is provided through a web interface which makes explicit the structural parameters 
of the application, infonnation about existing implementations of application tasks, and the 
data movement requirements between distinct tasks. Figures 12, 13 and 14 give a sample of 
• HAT parameters. 

The HAT also lets the user identify an active set, i.e. a set of task/machiiie implemen- 
tations that work together to compose an entire application. Since there may be multiple 
implementations, the active set identifies the particular task/machine allocations that will 
be used in a single full implementation of the application. For Jacobi2D, the active set was 
composed of a single task implementation per machine. In general, however, there may be 
several implementations from which to choose ajid multiple active sets. 

Notice that the user may not have all the infonnation requested by HAT. The system 
can use partial information to determine a schedule. However, as is tie case for the user, 
the better and more comprehensive the information available, the more performance-efficient 
the schedule is likely to be. 

4.3*3 User Specifications 

While the HAT describes application-specific information, information specific to a particular 
user or application developer is made available to an AppLeS through User Specifications 
(US) which will also be html-based. The most important role of the US is in the definition of 
user-specified requirements which fall into the three broad categories: execution constraints, 
performance objectives, and user preferences. Execution constraints refer to the access 
rights and resource constraints of the user. The user's performance objective is also 
conveyed through the US. For Jacobi2D, minimum execution time was the desired objective. 
Finally, the US allows the user to specify preferences for the Coordinator to attempt to 
satisfy. It may be that one resource should be preferred over another for non-performance 
related reasons. This feature gives the user tremendous control over the actions of AppLeS 
and the solutions that it generates. 
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4.3-4 Models 

The Model pool provides a set of model templates which are used for application performance 
estimation by the Planner, Performance Estimator and Resource Selector, Model templates 
are structures for composing models of characteristics which contribute to application per- 
formance. For example, in Jacobi, the model template for the execution time for processor 
iis 

Ti = Computation + Corrmunication 

where Computation is instantiated as Ai^Pi and Communication is instantiated as Cu as 
described in Section 2.1. . 

Model templates may be provided by the user. Default model templates for classes of 
applications (e.g. data parallel regiilar grid applications) will be available in the Model pool. 
Note that model templates can leverage successful models from the literature such as [2], 
[18] [10], [24], [22], etc. to predict the performance of the application and its tasks. 

4.4 Resource Selector 

The Resource Selector produces viable active sets to be considered by the Coordinator. It 
may iterate multiple times to identify a set of candidate active sets according^to different 
selection criteria. 

A potentially viable active set may be filtered to ensure its feasibility. Resources are 
prioritized with respect to an application-specific valuation function such as "distance", and 
filters are applied to the resource set to eliminate resources that are unusable. A filter 
may use inforrnation such as the iiser's access rights, memory constraints, implementation 
availability, etc. to eliminate resources quickly. Viable and feasible resource configurations 
will be scheduled by the Planner, evaluated by the Performance Estimator, and compared 
by the Coordinator to other candidate schedules. 

In the Jacobi2D example, filters considered two characteristics of each potential sched- 
ule: the area of region i, A*, and the available memory. Partitions with strips in which Ai 
was negative were filtered out. Next, resources which did not meet the memory require- 
ments of application tasks were also filtered out. Such constraints for most users are readily 
identifiable, and can be used profitably to reduce the resource selection space. 

4.5 Planner 

The function of the Planner is to create a schedule for a feasible active set. The schedule is 
based on a scheduling policy that optimizes for the user's performance measure. In practice, 
most users will employ common performance measures (execution time, cost, speedup), and 
the Planner will be equipped with default scheduling policies for these measures if the user 
chooses not to recommend a policy of his/her own. The schedule generated by the Planner 
must be in a format that the Actuator (described in section 4.7) can implement on the target 
resource system. 

In the Jacobi2D example, the Planner implemented a time-balancing scheduling policy. 
It took a list of candidate machines and their communication links (the feasible resource 
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set), and produced a mapping of grid strips to the machines. The Coordinator then used 
the Performance. Estimator to determine the execution time of each mapping generated by 
the Planner and passed the. best schedule to the Actuator. 

4,6 The Performance Estimator 

The performance estimator parameterizes a modd template with component models to pro- 
duce an estimate of application performance, given a schedule provided by the Planner. 
Parameters for the component models can be provided by the user or derived from other 
data sources such as the Network Weather. Service. Since dynamic information is included, 
the resulting estimates can be targeted to the time frame during which the application will 
. be run by the Actuator. In Jacobi, the formula Ti = Ai * -h Q is evaluated to obtain an 
estimate of the time necessary to compute each strip. 

Note that it is important to estimate the behavior of the application tasks in the context 
of the production systems in which they will be used: For this reason, we are developing 
models which forecast the slowdown of tasks on shared resources (networks and machines) 
[4]. Factoring slowdown into the model will provide a more realistic estimate of application 
and task performance in the presence of contention. 

4:7 Actuator 

AppLeS does not function as a resource manager - it relies on the services of existing resource 
majiagers to perform resource allocation and task instantiation. It is the job of the Actuator 
to mciplement the schedule (deternained by the Planner) using the semantics and facilities 
supported by the target resource management system. Some of these resource managers, 
such as PVM, are limited in scope and provide little additional functionality. Others, like 
Legion, have the potential for communicating considerable information about resource and 
application status. The Actuator will also convey whatever feedback information is available 
to the various subsystems. It acts as the conduit between the Coordinator and the underlying 
resource management facilities. 

The minimum functionality required by the Actuator is the ability to initiate a network 
connected task on a remote machine. More accurate scheduling can be accomplished when 
the resource management system returns feedback about when resources are actually avail- 
able for use, or can provide guaranteed service times in response to requests for service. Since 
the AppLeS agent is working at the application level, however, the Actuator minimally has 
access to whatever facilities the apphcation enjoys. It will use the same facilities to commu- 
nicate with the application and manage its task execution that the application itself uses to 
control its tasks. In that sense, the Actuator, and by extension the AppLeS agent, consti- 
tute an integrated extension of the program being scheduled. AppLeS and the application 
become part of the same execution instance. In the Jacobi2D example, the Actuator issued 
KeLP directives to control grid partitioning. These were the same primitives the application 
used to manage the grid itself. 
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5 Summary 



As network speeds increase and parallel distributed computing becomes more prevalent, 
resource-intensive applications will increasingly need to leverage shared, heterogeneous net- 
works of resources. EfiFective coordination of application components and their use of re- 
sources is key to performance. In this work, we described application-level scheduling 
as a way of achieving performance-efficient schedules for applications which execute on het- 
erogeneous networks of machines. We described principles which reflect the way in which 
applications are scheduled by their end-users and illustrated these principles by developing 
a '*proof-of-concept" application-level scheduler for a distributed data-parallel Jacobi ap- 
plication. We then described a general architecture for Application-Level Schedulers and 
described the subsystems which compose an AppLeS agent. . 

FVom the results generated by our prototype, it is clear that the AppLeS approach can 
achieve substantial, performance improvements for an individual application over conven- 
tional scheduling methods* Application-levd scheduling allows the user to deal with the 
heterogeneous system as it really is: under the control of multiple system schedulers, shared 
by other contending applications, and able to deliver only a dynamically varying fraction of 
resource performance. When such characteristics are explicitly factored into the scheduling 
activity, the application can better leverage the system to achieve performance. 
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HAT - Heterogeneous Application Template 

APPFUCAHON: 



USER: 



ygCr hmmt [nfoim^typn user Specificarion Informflrinti 



HAT * Structure Template 
INPUT: 

Amount of data needed to start applicatioii 
I I MBytes 

Ciincnt source (give full machine name e.g. paiagate.sdisc.edu) 



OUTPUT: 

Aroount of data letume d by application 

I MBytes 

Cunent source (give full machine name e.g. paiagate.sdsc.edu} 



ITERATION PHASE: Create new iteration pha?;e 



Listing of Implementations 

Structure Implementation Interface Help AdpLcS Manager 



Figure 12: The Structure module of HAT gives information about the general functional 
decomposition of the application, and lets a user identify an active set for the application. 
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HAT - bnplementation Template 



TASK: 
PLATFORM: 



PARADIGM: 

□Sequential QVector □ Task Parallel □ Data Parallel 

□ Single Processor □ Multi-Processor 
USAGE: 

□Dedicated □ Non-dedicated 
DATA STRUCTURES: 



Size I 

Compution per data structure 

Communication per data strucnire 



Bytes 



MFlops 
mids 



RATIO: 



Sdea em approximation or fill in numerical values 
□Cbmmumcation Heavy □ Balanced □ Computation Heavy 
Compution per data stzucnire I MFlops 

Communication per data structure) 



Words 



COMMUNICATION PATTERNS: 

□ PttoPt □ Stencil □ Multicast O Broadcast 



MEMORY: 

Core memory needed for in-core soPn 
TUNING FACTOR: 



MWords 



□ l (bubb)esoii)D3(cslOI) □5(ljiywgrad)G7(PhDUjesis)C3lO(handnii«rfasscunA 



Structure Implementation Interface JMSL 



AppLeS Manager 



Figure 13: The Implementation module focuses on how the task was implemented for a 
specific platform. 
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HAT - Interface Template 

IMPLEMENTATION A; 
IMPLEMENTATION B: 

NETWORK: . 

□Ethernet DHtppi □ AIMQ Odier 
CXIMMUNICATION FREQUENCY: 

Per application intcration :| • | MBytes 

AMMOUirr OF COMMUNICATION: . 

Tool I ' I MBytes QDqiendcnt oa do. of itenliQiu 

Perniessage l ' | MBytes 

DATA CONVERSION: 
Goovosion type: 

OFonnat Conversion □ Strticture Converoon 
Pafonned oic | • | 

PIPELINE: 

OPvetined Oua □ Strict Dan 

Size of Pipeline: | | MByi» 
SffllSBSS - Imatonentaakm intetftce ' AiwLeS Manage^ 



Figure 14: The Interface module of HAT characterizes the conimunication between imple- 
mentations A and B mapped to distinct execution sites. 
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Fundamentals of Grid Computing 



The purpose of.thls IBM Redpaper is to provide discussion material about grid computing, 
concepts, use, and architecture. Grid computing represents unlimited opportunities in terms 
of business and technical aspects. The audience for this paper are all hungry minds looking 
for a collection of fads and data about this new and exciting realm. 

The following major topics will be introduced to the readers: 

► What grid computing can do 

^ Grid concepts and components 
^ Grid construction 

► The present and the future 

► What the grid cannot do 

Grid computing, most simply stated, is distributed computing taken to the next evolutionary 
level. The goal is to create the illusion of a simple yet large and powerful self managing virtual 
computer out of a large collection of connected heterogeneous systems sharing various 
combinations of resources. 

The standardization of communications between heterogeneous systems created the Internet 
explosion. The emerging standardization for sharing resources, along with the availability of 
higher bandwidth, are driving a possibly equally large evolutionary step in grid computing. 



When you deploy a grid, it will be to meet a set of customer requirements. To better match 
grid computing capabilities to those requirements, it is useful to keep in mind the reasons for 
using grid computing. This section describes the most important capabilities of grid 



The easiest use of grid computing is to run an existing application on a different machine. The 
machine on which the application is normally run might be unusually busy due to an unusual 
peak in activity. The job in question could be run on an idle machine elsewhere on the grid. 



What grid computing can do 



computing. 



Exploiting underutilized resources 
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There are at least two prerequisHes lor this scenario. First, the application must be 
executable remotely and without undue overhead. Second, the remote machine must meet 
any special hardware, software, or resource requirements imposed by the application. 

For ©(ample, a batch job that spends a significant amount of time processing a set of input 
data to produce an output set is perhaps the most ideal and simple use for a grid If the 
quantit.es of input and output are large, more thought and planning might be required to 
effiaently use the gnd for such a job. It would usually not make sense to use a word 
processor remote V on a grid because there would probably be greater delays and more 
potential points of failure. 

In niost organizations, there are large amounts of underutilized computing resources Most 

T ^If '*'f'"*'^ ^'^ o^anizations. even the server 

nwchines can often be relatively idle. Grid computing provides a framework for exploiting 
these underutilized resources and thus has the possibility of substantially increasing the 
etnaency of resource usage. 

The processing resources are not the only ones that may be underutilized. Often, machines 
may have enormous unused disk drive capacity. Grid computing, more specifically, a "data 
gnd". can be used to aggregate this unused storage into a much larger virtual data store 
SSlSe *° ^'^'^ improved performance and reliability over that of any single 

l^v^^H'c? T^"^ *° T ^ If"^^ °* ♦'^'^ "^"^ "e automatically repeated 

fn r^Tn T J? 'V^^ " ^ ^ °" a remote machine 

in the gnd. the data is already there and does not need to be moved to that remote point This 
offers clear performance benefits. Also, such copies of data can be used as backups when 
the primary copies are damaged or unavailable. 

Another function of the grid is to better balance resource utilization. An organization may 
have occasional unexpected peaks of activity that demand more resources If the 
apphcalions are grid enabled, they can be moved to underutilized machines during such 
peaks. In fact some gnd implementations can migrate partially completed jobs. In general a 
gnd can provide a consistent way to balance the loads on a wider federation of resources 
This applies to CPU. storage, and many other kinds of resources that may be available on a 

«pf,n«f„?h!r "'^ t ^"^ *° "'^ "^^9^ P^««^"« '^'Ser organization, 
permitting better planning when upgrading systems, increasing capacity, or retirino 
computing resources no longer needed. » k /. a 



Parallel CPU capacity 



The poten^al for massive parallel CPU capacity is one of the most attractive features of a 
PnrtncJifc K P"'l?^'®"«fi«^ such Computing power is driving a new evolution in 
industnes such as the bio^nedical field, financial modeling, oil exploration, motion picture 
animation, and many others. 

U.tT'"'?. f """I® ^""""^ ^"''^ ^^"^ applications have been written to use 

algorithms that can be partitioned into independently running parts. A CPU intensiva orW 
application can be thought of as many smaller -subjobs," each executing on a different 
marine in the grid. To the extent that these subjobs do not need to communicate with each 
other, the more scalable" the appfication becomes. A perfectly scalable application will for 
example, finish 10 times faster if it uses 10 times the number of precessore 

Simfn^^!^" -f V° scalability. The first barrier depends on the algorithms used for 

nuS S!„?.S'°V'"''"^. """y " •''^ ^'9°^"'"" be split into a limited 

number of independently running parts, then that forms a scalability barrier. The second 
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barrier appears if the parts are not completely independent; this can cause contention, which 
can limit scalability. For example, rf all of the subjobs need to read and write from one 
common file or database, the access limits of that file or database will become the limiting 
factor in the application's scalability. Other sources of inter-job contention In a parallel grid 
application include message communications latencies among the jobs, network 
communication capacities, synchronization protocols, input-output bandwidth to devices and 
storage devices, and latencies interfering with real-time requirements. 

Applications 

There are many factors to consider in grid-enabling an application. One must understand that 
not all applications can be transformed to run in parallel on a grid and achieve scalability. 
Furthermore, there are no practical tools for transforming arbitrary applications to exploit the 
parallel capabilities of a grid. There are some practical tools that skilled application designers 
can use to write a parallel grid application. However, automatic transformation of applications 
is a science in its infancy. This can be a difficult job and often requires top mathematics and 
programming talents, if it is even possible in a given situation. New computation intensive 
applications written today are being designed for parallel execution and these will be easily 
grid enabled. If they do not already follow emerging grid protocols and standards. 

Virtual resources and virtual organizations for collaboration 

Another important grid computing contribution is to enable and simplify collaboration among a 
wider audience. In the past, distributed computing promised this collaboration and achieved it 
to some extent. Grid computing takes these capabilities to an even wider audience, while 
offering Important standards that enable very heterogeneous systems to work together to 
form the Image of a large virtual computing systern offering a variety of virtual resources, as 
Illustrated in Figure 1 on page 4. The users of the grid can be organized dynamically into a 
number of virtual organizations, each with different policy requirements. These virtual . 
organizations can share their resources collectively as a larger grid. 

Sharing starts with data in the form of files or databases. A "data grid" can expand data 
capabilities in several ways, Rrst, files or databases can seamlessly span many systems and 
thus have larger capacities than on any single system. Such spanning can improve data 
transfer rates through the use of striping techniques. Data can be duplicated throughout the 
grid to serve as a backup and can be hosted on or near the machines most likely to need the 
data, in conjunction with advanced scheduling techniques. 

Sharing is not limited to files, but also includes many other resources, such as equipment, 
software, services, licenses, and others. These resources are "virtuallzed" to give them a 
more uniform interoperabllrty among heterogeneous grid participants. 

The participants and users of the grid can be members of several real and virtual 
organizations. The grid can help in enforcing security rules among them and implement 
policies, which can resolve priorities for both resources and users. 
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Access to additional resources 

In addition to CPU and storage resources, a grid can provide access to Increased quantities 
of other resources and to special equipment, software, licenses, and other services The 
additional resources can be provided In additional numbers and/or capacity. 

H»[/!lt!!''"^' '* ^ T^' "^^"^ '° '"^^^^ bandwidth to the Internet to implement a 

data mining search engine, the work can be split among grid machines that have independent 

machine has a separate connection to the Internet. If the machines had shared the 
connection to the Internet, there would not have been an effective increase in bandwidth. 

EhTL'"!'"''"^! may have expensive licensed software installed that the user requires. His 
jobs can be sent to such machines more fully exploiting the software licenses. 

npl^T!^m" w °" ^J"? '"^^ '^^ Most of us have used remote printers 

pertiaps with advanced color capabilities or faster speeds. Similarly, a grid can be used to 
make "se of other special equipment. For example, a machine maj haCe a Wgh spei lelf 

on the gnd rnay be connected to scanning electron microscopes that can be operated 
irr;?^'. i ^^T' t''^^'*"«"9 reservation are important. A specimen could be-sent in 
advance to the facility hosting the microscope. Then the user can remotely operate the 
machine, changing perspective views until the desired image Is captured. 

ratofcl?r«prl"f««f PO»entially to remote medical diagnostic and 

by one s imagination. Today, we have remote device drivers for printers. Eventually, we wiO 
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see standards for grid enabled device drivers to many unusual devices and resources. All of 
these will make the grid look like a large virtual machine with a collection of virtual resources 
beyond what would be available on just one conventional machine. 



Resource balancing 

A grid federates a large number of resources contributed by individual machines into a 
greater total virtual resource. For applications that are grid enabled, the grid can offer a 
resource balancing effect by scheduling grid jobs on machines with low utilization, as 
illustrated in Figure 2. This feature can prove invaluable for handling occasional peak loads of 
activity in parts of an larger organization. This can happen in two ways: 

► An unexpected peak can be routed to relatively idle machines in the grid. 

► If the grid is already fully utilized, the lowest priority work being performed on the grid can 
be temporarily suspended or even cancelled and performed again later to make room for 
the higher priority work. 

Without a grid infrastructure, such balancing decisions are difficult to prioritize and execute. 

Occasionally, a project may suddenly rise in importance with a specific deadline. A grid 
cannot perform a miracle and achieve a deadline when it is already too close. However, if the 
size of the job is known, if it is a kind of job that can be sufficiently split into subjobs, and if 
enough resources are available after preempting lower priority work, a grid can bring a very 
large amount of processing power to solve the problem. In such situations, a grid can, with 
some planning, succeed in meeting a surprise deadline. 





Figure 2 Jobs are migrated to less busy parts of the grid to balance resource toads and absorb 
unexpected peaks of activity in a part of an organization 
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Reliability 



other more subtle benefits can occur using a grid for load balancing. When jobs 

SrSuT,?*!"?: °' storage resources, an advanced scheduler 

could schedule them to minimize communications traffic or minimize the distance of the 
commun.cat«ns. This can po^^^^^ 

S n^S'S^^^ excellent infrastructure for brokering resources. Individual resources 

S^ulKfrnt^Sfn-r ""^"^"'"'y ^"'^ ^P^^'^y- ^"'^ «»^« be factored into 
Sd^"!n2 .H 9"^- D'«erent organizations participating in the grid can build up grid 
credrts and use them at times when they need additional resources. This can fomi Se basis 
for gnd accounting and the ability to more fairiy distribute work on the grtJ 

K^'r"""" K '''''"^"""9 ^y^*®""^ expensive hardware to increase reliability 

tSlve^lS^TPl ''I' °" ^"'^ ~"«ain much logic 

to achieve graceful recovery from an assortment of haitJware failures. The machines also use 

^^'''^ '""P"^' ""^ ^y^*^'"^ ''"P'fcated. The ^stS^s aVe 

f .1 f f." ^P"*^'^' generators if ufflity power is irtemjot^ All 

^mponeSs.' ' «^ dupli^aE hiCS?^ 

ScJ^oil,"'^'^ " an alternate approach to reliability that relies more on software 
IvSr^S «^,«''P«"S'>'e hardware. A grid is just the beginning of such technology. ^e" 
systems in a gnd can be relatively inexpensive and geographically dispersed Thus ttie% is 

«fEr; r °' P«rt« *e 9"d are not lely to be 

ftSwh2 ^" automatically resubmit jobs to other machSs on 

Jln^?^ K ^ situations, multiple copies of the 

irnportant jobs can be run on different.machines throughout the grid, as illustrated in Figure 3 ' 

Zu^L^' "^T °' inoonsisTen^. such as c^Ser 

failures, data corruption, or tampering. ^mpuwr 



Fundamentals of Grid Computing 



Figure 3 Redundant grid configuration and redundant job submission used to achieve high reiiabiGty 

Such grid systems will utilize "autonomic computing," This is a type of software that 
automatically heals problems in the grid, perhaps even before an operator or manager is 
aware of them. In principle, most of the reliability attributes achieved using hardware in 
today's high availability systems can be achieved using software In a grid setting in the future. 

Management 

The goal to virtualize the resources on the grid and more uniformly handle heterogeneous 
systems will create new opportunities to better manage a larger, more disperse IT 
infrastructure. It will be easier to visualize capacity and utilization, making it easier for IT 
departments to control expenditures for computing resources over a larger organization. 

The grid offers management of priorities among different projects. In the past, each project 
may have been responsible for its own IT resource hardware and the expenses associated 
with it. Often this hardware might be underutilized while another project finds itself in trouble, 
needing more resources due to unexpected events. With the larger view a grid can offer, it 
becomes easier to control and manage such situations. As illustrated in Figure 4 on page 8» 
administrators can change any number of policies that affect how the different organizations 
might share or compete for resources. 

Aggregating utilization data over a larger set of projects can enhance an organization's ability 
to project future upgrade needs. When maintenance is required, grid work can be rerouted to 
other machines without crippling the projects involved. 

Autonomic computing can come into play here too. Various tools may be able to identify 
important trends throughout the grid. Informing management of those that require attention. 
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Figure 4 Administrators can adjust policies to better allocate resources 



Grid concepts and components 

In this section, we introduce the various grid concepts, components, and terms In more detalL 

Types of resources 

A grid is a collection of machines, sometimes referred to as "nodes," "resources,** "members " 
"donors," "clients." "hosts," "engines," and many other such terms. They all contribute any 
combination of resources to the grid as a whole. Some resources may be used by all users of 
the grid while others may have specific restrictions. 

Computation 

The most common resource is computing cycles provided by the processors of the machines 
on the gnd. The processors can vary in speed, architecture, software platform, and other 
associated factors, such as memory, storage, and connectivity. There are three primary ways 
to exploit the computation resources of a grid. The first and simplest is to use it to mn an 
existing application on an available machine on the grid rather than locally. The second Is to 
use an application designed to split its work in such a way that the separate parts can^execute 
in parallel on different processors. The third is to run an application that needs to be executed 
many times on many different machines in the grid. "Scalability" is a measure of how 
efficiently the multiple processors on a grid are used. If twice as many processors makes an 
application complete in one half the time, then it Is said to be perfectly scalable However 
there may be limits to scalability when applications can only be split into a limited number of 
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separately running parts or if those parts experience some other contention for resources of 
some kind. 



Storage 

The second most common resource used in a grid is data storage. A grid providing an 
integrated view of data storage is sometimes called a "data grid." Each machine on the grid 
usually provides some quantity of storage for grid use, even If temporary. Storage can be 
memory attached to the processor or it can be "secondary storage" using hard disk drives or 
other permanent storage media. Memory attached to a processor usually has very fast 
access but is volatile. It would best be used to cache data to serve as temporary storage for 
running applications. 

Secondary storage in a grid can be used in Interesting ways to increase capacity, 
performance, sharing, and reliability of data. Many grid systems use mountable networked file 
systems, such as Andrew File System (AFS). Network File System (NFS), Distributed File 
System (DFS), or General Parallel File System (GPFS). These offer varying degrees of 
performance, security features, and reliability features. 

Capacity can be increased by using the storage on multiple machines with a unifying file 
system. Any Individual file or data base can span several storage devices and machines, 
eliminating maximum size restrictions often Imposed by file systems shipped with operating 
systems. A unifying file system can also provide a single uniform name space for grid 
storage. This makes it easiel^ for users to reference data residing In the grid, without regard 
for its.exact location. In a similar way, special data base software can lederate** an 
assortment of individual data bases and files to form a larger, more comprehensive data 
base, accessible using data base query functions. 




I Virtualization 
*^ Capacity 
^ Sharing 
► Availability 

> Striping - speed 

> Mirrors - reliability 

> Replicas - remote 

> Journals - transactions 



Striped virtual file system 




Mirrors, Replicas, Journals... 



Figure 5 Data striping is writing or reading successive records to/from different pfiysical devices, 
overlapping tt)e access for faster ttiroughput; additional techniques increase reliability 

More advanced file systems on a grid can automatically duplicate sets of data, to provide 
redundancy for increased reliability and increased performance. An intelligent grid scheduler 
can help select the appropriate storage devices to hold data, based on usage patterns. Jobs 
can then be scheduled closer to the data, preferably oh the machines directly connected to 
the storage devices holding the required data. 
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o 1" ^ ^° implemented by grid file systems, as illustrated in Figure 5 on 
Sp .h^J^.?'^^ 11""®""^' °' P^«*««We access patterns to data, this technique can 
create the wr^ual effect of having storage devices that can transfer data at a faster rate than 
any individual disk drive. This can be important for multimedia data streams or M, Jn 

«S„iT °' "^^^^ ^' ^'^'"^'y '^'^^ CAT scans or particle physics 

expenments, for example. k / 

^ftof JLV^?-^!? fnfement joumaling so that data can be recovered more reliably 

after certain kinds of failures. In addition, some file systems implement advanced 
syn^ronization mechanisms to reduce contention when data is shared and updated by many 



Communications 

H^Jtif ^"""^^ in communication capacity among machines today makes grid computing 
practical, compared to the limited bandwidth available when distribirted computing was firet 

if^STr^l^*?" ,^P^«'y- '"'^'"des communications within the grid and external to 
Je gnd. Communications within the grid are important for sending jobs and their required 
data to points within the grid. Some jobs require a large amount of data to be p^Sd and 
rtmay not a ways reside on the machine running the job. The bandwidth available for such 
communications can often be a critical resource that can limit utilization of the grid. 

^ITl^^TJ't'^^l^ ^ *^ ^'°'"P'«' be valuable when building 

C °" ^ 9"** ""^y connections to the external Internet in 

S^imi connectivity among the grid machines. When these connections do not share 

S!e JftTrneT ' ^ '° "^"^ ^"^"^"^ ^^"^ accessing 

f ri''°"'"'""''^J'°" sometimes needed to better handle potential network 

Sp^mfnlZ^f ''k' =P««^ "e^^o^ks must be provided to 
Tn hl«!., r "Jf ° transfen-ing larger amounts of data. A grid managemen? system 
can better show the topology of the grid and highlight the communication bottleneck^ This 
information can in turn be used to plan for hardware upgrades. ««'enecKs. nis 

Software and licenses 

Tf^ ^"^"^ '"''^"^ '"^y expensive to install on every grid 

machine Using a gnd. the jobs requiring this software are sent to the particular machines on 
Which this software happens to be installed. When the Ifcensing fees are s^S tWs 
approach can save significant expenses for an organization. gnmcam. inis 

Some software licensing arrangements permit the software to be installed on all of the 
7^ Sf. inftll^- ""^ °' installations that can be simultaneously used 

^nSl^MK ^ "-"^"^^ management software keeps track of how many concurrent 

ofvfi ^ ^^l^JT' P^"^^"'^ "^'^ number f!om executing at 

l7r2T^T- K^?" ^""^ schedulers can be configured to take software iteenses into 
account, optionally balancing them against other priorities or polkaes. 

Special equipment, capacities, architectures, and policies 

Platforms on the grid will often have difterent architectures, operating systems devices 
capacrties. and equipment. Each of these items represents a different kind of resource that 

'^"^ '° machines. While some software may be 
avjiawe on several architectures, for example, PowerPC and x86. such software is often 
designed to run only on a particular type of hardware and operating system Such anributes 
must be considered when assigning jobs to resources in the grid ^'^"'■.^''^ 
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In some cases, the administrator of a grid may create a new artificial resource type that is 
used by schedulers to assign work according to policy rules or other constraints. For 
example, some machines may be designated to only be used for medical research. These 
would be identified as having a medical research attribute and the scheduler could be 
configured to only assign jobs that require machines of the medical research "resource." 
Others may be participate In the grid only if they are not used for military purposes. In this 
situation, jobs requiring a "military resource" would not be assigned to such machines. Of 
course, the administrators would need to impose a classification on each kind of job through 
some certification procedure to use this kind of approach. 

Jobs and applications 

Although various kinds of resources on the grid may be shared and used, they are usually 
accessed via an executing "application" or "job." Usually we use the term "application" as the 
highest level of a piece of work on the grid. However, sometimes the term "job" is used 
equivalently Applications may be broken down into any number of Individual jobs, as 
Illustrated In Figure 6. Those. In turn, can be further broken down Into "subjobs." The grid 
industry uses other terms, such as transaction, work unit, or submission, to mean the same 
thing as a job. 

Jobs are programs that are executed at an appropriate point on the grid. They may compute 
something, execute one or more system commands, move or collect data, or operate 
machinery. A grid application that is organized as a collection of jobs is usually designed to 
have these jobs execute in parallel on different machines In the grid. 



Jobs and subjobs to run 




Collecting results 



Figure 6 An application is one or more jobs that are scheduled to run on machines in the grid; the 
results are collected and assembled to produce the answer 

The jobs may have specific dependencies that may prevent them from executing in parallel in 
all cases. For example, they may require some specific input data that must be copied to the 
machine on which the job is to run. Some jobs may require the output produced by certain 
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other jobs and cannot be executed untO those prerequisite jobs have completed executing 
Jobs may spawn additional subjobs. depending on the data they process. This work flow can 
create a hierarchy of jobs and subjobs. Finally, the results of all of the jobs must be collected 
and appropriately assembled to produce the ultimate answer lor the application. 

Scheduling, reservation, and scavenging 

The grid system is responsible for sending a job to a given machine to be executed In the 
simplest of gnd systems, the user may select a machine suitable for running his job and then 
execute a grid command that sends the job to the selected machine. More advanced grid 
systems would include a job "scheduler" of some kind that automatically finds the most 
appropriate machine on which to run any given job that is waiting to be executed Schedulers 
react to current availability of resources on the grid. The term "scheduling- is not to be 
confused with "reservation" of resources in advance to improve the quality of service 
Sometimes the temi "resource broker" is used in place of "scheduler." but this term implies 
that some sort of bartenng capability is factored into scheduling. 

im.'TfT^'^^ ^"'^ '"^'^^'"^ '•'^^ typically report its idle 

status to the gnd management node. This management node would assign to this idle 
machine the next job that is satisfied by the machine's resources. Scavenging is usually 
implemented in a way that is unobtrusive to the normal machine user. If the machine 
becomes busy with local non^rid work, the grid job is usually suspended or delayed This 
situation creates somewhat unpredictable completion times for grid jobs, although it is not 
disruptive to those machines donating resources to the grid. 

To create more predictable behavior, grid machines are often "dedicated" to the grid and are 
not preempted by outside work. This enables schedulers to compute the appro»>nate 
completion time for a set of jobs, when their running characteristks are known. 

As a further step, grid resources can be "reserved" in advance for a designated set of jobs 
Such reservations operate much like a calendaring system used to reserve conference rooms 
for meetings. This is done to meet deadlines and guarantee quality of service. When policies 
permit resources reserved in advance could also be scavenged to run lower priority jobs 
when they are not busy during a reservation period, yielding to jobs for which they are 
reserved. Thus, various combinations of scheduling, reservation, and scavenging can be 
used to more completely utilize the grid. a uc 

Sdieduling and reservation is fairly straightforward when only one resource type usually 
CPU. IS involved. However, additional grid optimizations can be achieved by conslderina 
more resources in the scheduling and reservation process. For example, it would be 
desirable to assign executing jobs to machines nearest to the data that these jobs require 
This would reduce networt< traffic and possibly reduce scalability limits. Optimal scheduling 
considering multiple resources, is a difficult mathematics problem. Therefore, such 

heuristics. These heuristics are rules that are designed to improve the 
probability of finding the best combination of job schedules and reservations to optimize 
throughput or any other metric. 



Intragrid to Intergrid 



There have been attempts to formulate a precise definitfon for what a "grkf is In fact the 
concept of gnd computing is still evolving and most attempts to define it precisely end up 

^ofn?"^.""'' ""^"y ^ 9"*- We will be pragmatic and 

no claim to make any definitive descriptions of what a grid is and is not. Therefore, the 
following descnptions of various kinds of "grids" must be taken loosely. 
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Grids can be built in all sizes, ranging from just a few machines in a department to groups of 
machines organized as a hierarchy spanning the world. In this section, we will describe some 
examples in this range of grid system topologies. 




Figure 7 A simple grid 



As presented in Figure 7, the simplest grid consists of just a few machines, all of the same 
hardware architecture and same operating system, connected on a local network. This kind of 
grid uses homogeneous systems so there are fewer considerations and may be used just for 
experimenting with grid software. The machines are usually in one department of an 
organization, and their use as a grid may not require any special policies or security 
concerns. Because the machines have the same architecture and operating system, 
choosing application software for these machines is usually simple. Some people would call 
this a "cluster" implementation rather than a "grid." 

The next progression would be to include heterogeneous machines. In this configuration, 
more types of resources are available. The grid system is likely to include some scheduling 
components. File sharing may still be accomplished using networked file systems. Machines 
participating in the grid may include ones from multiple departments but within the same 
organization. Such a grid is also referred to as an "Intragrid." 

As the grid expands to many departments, policies may be required for how the grid should 
be used. For example, there may be policies for what kinds of work is allowed on the grid and 
at what times. There may be a prioritization by department or by kinds of applications that 
should have access to grid resources. Also, security becomes more important as more 
organizations are involved. Sensitive data in one department may need to be protected from 
access by jobs running for other departments. Dedicated grid machines may be added to 
increase the quality of service for grid computing, rather than depending entirely on 
scavenged resources. 
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The grid may grow geographically in an organization that has facilities in different dties 
Dedicated communications' connections may be used among these facilities and the grid In 
some cases. VPN tunneling or other technologies may be used over the Internet to connect 
the different parts of the organization. Security increases in importance once the bounds of 
any given facility are traversed. The grid may grow to be hierarchically organized to reduce 
the contention implied by central control, increasing scalability. 




Figured A more complex Intergrid 

Over time, as illustrated in Figure 8, a grid may grow to cross organization boundaries, and 
may be used to collaborate on projects of common interest. This Is known as an "Intergrid " 
The highest levels of security are usually required in this configuration to prevent possible 
attacks and spying. The Intragrtd offers the prospect for trading or brokering resources over a 
much wider audience. Resources may be purchased as a utility from trusted suppliers. 



Grid construction 

An ad/Jocgrid may be installed by a few programmers in their spare time, but as the grid 
grows, and as users become more dependent on It for mission-critical work, a degree of 
planning Is essential- It is best to understand the organization's requirements and choose grid * 
technologies that best fit these requirements. This section discussed some of the planning 
considerations and grid components that address the requirements. 
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Deployment planning 

The use of a grid is often born from a need for Increased resources of some type. One often 
looks to their neighbor who may have excess capacity in the particular resource. One of the 
first considerations Is the hardware available and how It is connected via a LAN or WAN. Next, 
an organization may want to add additional hardware to augment the capabilities of the grid. It 
Is important to understand the applications to be used on the grid. Their characteristics can 
affect the decisions of how to best choose and configure the hardware and Its data 
connectivity. 

Security 

Security Is a much more Important factor in planning and maintaining a grid than in 
conventional distributed computing, where data sharing comprises the bulk of the activity. In a 
grid, the member machines are configured to execute programs rather than just move data. 
. This makes an unsecured grid potentially fertile ground for viruses and Trojan horse 

programs. For this reason, It Is important to understand exactly which components of the grid 
must be rigorously secured to deter any kind of attack. Furthermore. It is Important to 
understand the issues involved In authenticating users and properly executing the 
responsibilities of a certificate authority. 

Organization 

The technology considerations are important in deploying a grid. However, organizational and 
business issues can be equally important. It is important to understand how the departments 
in an organization interact, operate, and contribute to the whole. Often, there are barriers built 
between departments and projects to protect their resources in an effort to increase the 
probability of timely success. However, by rethinking some of these relationships, one can 
find that more sharing of resources can sometimes benefit the entire organization better. For 
example, a project that finds itself behind schedule and over budget may not be able to afford 
the resources required to solve the problem. A grid would ^ve such projects an added 
measure of safety, providing an extra margin of resource capacity needed to finish the project. 
Similarly, a project In its eariy stages, when computing resources are not being fully utilized, 
may be able to donate them to other projects in need. A grid also offers the ability for the 
organization's management to see the bigger priority picture and react more quickly in shifting 
resource utilization, priorities, and policies. 

Grid software components 

This section presents some of the key components that must be discussed before designing 
a grid computing architecture. 

Management components 

Any grid systenr* has some management components. Rrst, there is a component that keeps 
track of the resources available to the grid and which users are members of the grid. This 
information is used primarily to decide where grid jobs should be assigned. 

Second, there are measurement components that determine both the capacities of the nodes 
on the grid and their current utilization rate at any given time. This information is used to 
.schedule jobs in the grid. Suqfi information is also used to determine the health of the gn^, 
alerting personnel to problems such as outages, congestion, or overcommitment. This 
information is also used to determine overall usage patterns and statistics, as well as to log 
and account for usage of grid resources. 

Third, advanced grid management software can automatically manage many aspects of the 
grid. TTiis is known as "autonomic computing," or "recovery oriented computing." This 
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software would automatically recover from various kirute of grid failures and outages findinq 
alternative ways to get the workload processed. 

Donor software 

Each machine contributing resources typically needs to enroll as a member of the grid and 
instan some software that manages the grid's use of its resources. Usually, some sort of 
Identification and authentication procedure must be performed before a machine can join the 
gnd. A certificate authority can be used to establish the identity of the donor machine as well 
as the users and the grid itself. 

Some grid systems provide their own login to the grid while others depend on the native 
operating systems for user authentication. In the latter case, a user ID mapping system may 
be needed to match the user's rights properly on different machines. This typically is manually 
maintained by a grid administrator. He determines which user ID a given user may possess 
on each gnd machine and enters these IDs in a protected data base or registry. In this way 
when grid jobs are submitted to different machines for a user, the proper local machine user 
ID IS used for determining the users rights. 

In some grid systems, it is possible to join the grid without any special authentication. And in 
others, It IS possible for any user to submit jobs to the grid. Such systems may be com^enient 
to set up, but should be discouraged in larger deployments due to the serious security 
problems that they would open up. 

The grid system makes information about the newly added resources available throughout 
the gnd. The donor machine will usually have some sort of monitor that determines or 
measures how busy the machine is and the rate or amount of resources utilized This 
infomration is "bubbled up" to the management software of the grid and used to schedule use 
of those resources accordingly In a scavenging system, this information tells the grid 
management software when the machine is idle and available for work. 

Most importantly, the software installed on a given machine can accept an executable job 
from the gnd management system and execute it. A user somewhere on the grid submits a 
job for execution on the grid. The grid management software must communicate with the grid 
donor software to send the job there. The donor grid software must be able to receive the 
executable file or select the proper one from copies pre-installed on the donor machine The 
software is executed and the output is sent back to the requester. More advanced 
implementations can dynamically adjust the priority of a running job. suspend it and resume it 
later, or checkpoint it with the possibility of resuming its execution on a different machine 
Tliese kinds of actions may be necessary to respond to load balancing problems or priority or 
policy changes in the grid. f tyvii 

Submission software 

Usually any member machine of a grid can be used to submit jobs to the grid and initiate grid 
quenes. However, in some grid systems, this function is implemented as a separate 
component installed on "submission nodes" or "submission clients." When a grid is built using 
dedicated resources rather than scavenged resources, separate submission software is 
usually installed on the user's desktop or workstation. 

Distributed grid management 

Larger grids may have a hierarchical or other type of organizational topology usually 
matching the connectivity topology. That is, machines locally connected together with a LAN 
form a "cluster- of machines. The grid may be organized in a hierarchy consisting of clusters 
Of clusters^The work involved in managing the grid is distributed to increase the scalability of 
the gnd. The collection and grid operation and resource data as well as job scheduling is 
distnbuted to match the topology of the grid. For example, a central job scheduler will not 
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schedule a submitted job directly to the machine which is to execute it. instead the job is sent 
to a lower level scheduler which handles a set of machines (or further clusters). The lower 
level scheduler handles the assignment to the specific machine. Similarly, the collection of 
statistical information is distributed. Lower level clusters receive activity Information from the 
individual machines, aggregate it, and send it to higher level management nodes in the 
hierarchy. 

Schedulers 

Most grid systems include some sort of job scheduling software. This software locates a 
machine on which to ruri a grid job that has been submitted by a user. In the simplest cases, 
it may just blindly assign jobs in a round-robin fashion to the next machine matching the 
resource requirements. However, there are advantages to using a more advanced scheduler. 

Some schedulers implement a job priority system. This is sometimes done by using several 
job queues, each with a different priority. As grid machines become available to execute jobs, 
the jobs are taken from the highest priority queues first. Policies of various kinds are also 
Implemented using schedulers. Policies can include various kinds of constrains on jobs, 
users, and resources. For example, there may be a policy that restricts grid jobs from 
executing at certain times of the day. 

Schedulers usually react to the Immediate grid load. They use measurement information 
about the cun-ent utilization of machines to determine which ones are not busy before 
submitting a job. Schedulers can be organized in a hierarchy. For example, a meta-scheduler 
may submit a job to a cluster scheduler or other lower level scheduler rather than to an 

individual machine. 

More advanced schedulers will monitor the progress of scheduled jobs managing the overall 
work-flow. If the jobs are lost due to system or network outages, a good scheduler will 
automatically resubmit the job elsewhere. However, if a job appears to be in an infinite loop 
and reaches a maximum timeout, then such jobs should not be rescheduled. Typically, jobs 
have different kinds of completion codes, some of which are suitable for re-submisslon and 
some of which are not. 

Reserving resources on the grid in advance is accomplished with a "reservation system." It is 
more than a scheduler. It is first a calendar based system for reserving resources for specific 
time periods and preventing any others from reserving the same resource at the same time. It 
also must be able to remove or suspend jobs that may be running on any machine or v 
resource when the reservation period is reached. 

Communications 

A grid system may include software to help jobs communicate with each other. For example, 
an application may split itself into a large number of subjobs. Each of these subjobs is a 
separate job in the grid. However, the application may implement an algorithm that requires 
that the subjobs communicate some information among them. The subjobs need to be able to 
locate other specific subjobs, establish a communications connection with them, and send 
the appropriate data. The open standard Message Passing Interface (MPI) and any of several 
variations is often included as part of the grid system for just this kind of communication. 

Observation, management, and measurement 

We mentioned above the schedulers react to current loads on the grid. Usually, the donor 
software will include some tools that measure the cun-ent load and activity on a given 
machine using either operating system facilities or by direct measurement. This software is 
sometimes referred to as a "load sensor." Some grid systems provide the means for 
implementing custom load sensors for other than CPU or storage resources. 
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Such measurement information is useful not only for scheduling, but also for discovering 
overall usage patterns in the grid. The statistics can show trends which may signal the need 
for additional hardware. Also, measurement Information about specific jobs can be collected 
and used to better predict the resource requirements of that job the next time it is run. The 
better the prediction, the more effidently the grid's workload can be managed. 

The measurement Information can also be saved for accounting purposes, to form the basis 
for grid resource brokering^ or to manage priorities more fairly. The information can also be 
displayed in various forms to better visualize grid activity and utilization. 



Using a grid: A user's perspective 

This section describes the typical usage activities In using the grid from an user's perspective. 
Enrolling and installing grid software 

A user first enrolls as a grid user, and installs the provided grid software on his own machine. 
He may optionally enroll his machine as a donor on the grid. 

Enrolling in the grid may require authentlcatfon for security purposes. The user positively 
establishes his identity with a certificate authority. This should not be done solely via the 
Internet. The certificate authority must take steps to assure that the user Is in fact who he 
claims to be. The certificate authority makes a special certificate available to software 
needing to check the true identity of a grid user and his grid requests. Similar steps may be 
required to identify the donating machine. The user has the responsibility of keepinq his arid 
credentials secure. 

Once the user and/or machine are authenticated, the grid software is provided to the user for 
installing on his machine for the purposes of using the grid as well as donating to the grid. 
This software may be automatically preconfigured by the grid management system to know 
the communication address of the management nodes in the grid and user or machine 
identification information. In this way. the installatfon may be a one click operation with a 
minimum of interaction required on the part of the user. In less automated grid installations, 
the user may be asked to identify the grid's management node and possibly other 
configuration information. He may choose to limit the resources donated to the grid, the times 
that his machine Is usable by the grid, and other policy related constraints. The user may also 
need to inform the grid administrator which user IDs are his on other machines that exist on 
the grid 



Logging onto the grid 

To use the grid, most grid systems require the user to log on to a system using a user ID that 
IS enrolled in the grid. Other grid systems may have their own grid login ID separate from the 
one on the operating system. A grid login is usually more convenient for grid users. It 
eliminates the ID matching problems among different machines. To the user, it makes the grid 
look more like one large virtual computer rather than a collection of individual machines. 
Globus, for example. Implements a proxy login model that keeps the user logged in for a 
specified amount of time, even if he iogs off and back on the operating system and even if the . 
machine Is rebooted. 

Once logged on. the user can query the grid and submit jobs. Some grid implementations 
permit some query functions If the user is not logged into the grid or even If the user is not 
enrolled in the grid. 
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Queries and submitting jobs 

The user will usually perform some queries to check to see how busy the grid Is, to see how 
his submitted jobs are progressing, and to look for resources on the grid. Grid systems 
usually provide command line tools as well as graphical user interfaces (GUIs) for queries. 
Command line tools are especially useful when the user wants to write a script that 
automates a sequence of actions. For example, the user might write a script to look for an 
available resource, submit a job to it, watch the progress of the job, and present the results 
when the job has finished. 

Job submission usually consists of three parts, even if there Is only one command required. 
First, some input data and possibly the executable program or execution script file are sent to 
the machine to execute the job. Sending the input is called "staging the input data." 
Alternatively, the data and program files may be pre-installed on the grid machines or 
accessible via a mountable networked file system. When the grid consists of heterogeneous 
machines, there may be multiple executable program files, each compiled for the different 
machine platforms on the grid. A nice feature provided by some grid systems is to register 
these multiple versions of the program so that the grid system can automatically choose a 
correctly matching version to the grid machine that will run the program. Some grid 
technologies require that the program and input data be first processed or 'Vrappered" in 
some way by the grid system. This may be done to add protective execution controls around 
the application or just to simply collect all of the data files into one. 

Second, the job is executed on the grid machine. The grid software running on the donating 
machine executes the program in a process on the user's behalf. It may use a common user 
ID on the machine or tt may use the user's own user ID, depending on which grid technology 
is used. Some grid systems implement a protective "sandbox" around the program so that it 
cannot cause any disruption to the donating machine if it encounters a problem during 
execution. Rights to access files and other resources on the grid machine may be restricted. 

Third, the results of the job are sent back to the submitter. In some implementations, 
intermediate results can be viewed by the user who submitted the job. In some grid 
technologies that do not automatically stage ^e output data back to the user, the results must 
be explicitly sent to the user, perhaps using a networked file system. 

Scripts are also useful for submitting a series of jobs, for a parameter space application, for 
example. Some computation problems consist of a search for the desired result based on 
some input parameters. The goal is to find the input parameters that produce the best desired 
result. For each input parameter, a separate job Is executed to find the result for that value. 
The whole application consists of many such jobs, which explore the results for a large 
number of input parameter values. Scripts are usually used to launch the many subjobs, each 
receiving their own particular parameter values. Parameter inputs can sometimes be more 
complex than simply a number. Sometimes a different input data set represents the "input 
parameter." Scripts help automate the large variety of. more complex parameter space study 
problems. For simpler parameter space inputs, some grid products provide a GUI to submit 
the series of subjobs, each with different input parameter values. 

When there are a large number of subjobs, the work required to collect the results and 
produce the final result Is usually accomplished by a single program, usually running on the 
machine at the point of job submission. If there are a very large number subjobs required for 
' an application, the work of collecting the results might be distributed as well. For example, the 
subjob that submits more subjobs to the grid would be responsible for collecting and 
aggregating the results of the subjobs it spawned. 
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Data configuration 

The data accessed by the grid jobs may simply be staged in and out by the grid system. 
• However, depending on Its size and the number of jobs, this can potentially add up to a large 
amount of data traffic. For this reason, some thought is usually given on how to arrange to 
have the minimum of such data nrovement on the grid. 

For example, if there will be a very large number of sub-jobs running on most of the grid 
systems for an application that will be repeatedly run. the data they use may be copied to 
each machine and reside until the next time the application runs. This is preferable to using a 
networked file system to share this data, because in such a file system, the data would be 
effectively moved from a central location every time the application Is ruri. Thus Is true unless 
the file system implernents a caching feature or replicates the data automatically. 

There are many considerations in efficiently planning the distribution and sharing of data on a 
gnd. This type of analysis Is necessary for large jobs to better utilize the grid and not create 
unnecessary bottlenecks. 

Monitoring progress and recovery 

The user can query the grid system to see how his application and its subjobs are 
progressing. When the number of subjobs becomes large, it becomes to difficult to list them 
all in a graphical window. Instead, there may simply be a one large bar graph showing some 
averaged progress metric. It becomes more difficult for the user to tell if any particular sublob 
is not running properly. 

A grid system, in conjunction with its job scheduler, often provides some degree of recovery 
for subjobs that ^1. A job may fail due to a: 

► Programming error: The job stops part way with some program fault. 

► Hardware or power failure: The machine or devices being used stop working in some way. 

► Communications interruption: A communication path to the machine has failed or is 
overloaded with other data traffic. 

► Excessive slowness: The job might be in an infinite loop or normal job progress may be 
limited by another process running at a higher priority or some other form of contention. 

It is not always possible to automatically determine if the reason for a job's failure is due to a 
problem with the design of the application or if it is due to failures of various kinds in the grid 
system infrastructure. Schedulers are often designed to categorize job failures in some way 
arid automatically resubmit jobs so that they are likely to succeed, running elsewhere on the 
grid. In some systems, the user is informed about any job failures and the user must decide 
whether to issue a command to attempt to rerun the failed jobs. 

Grid applications can be designed to automate the monitoring and recovery of their own 
subjobs using functions provided by the grid system software application proqrammina 
Interfaces (APIs). 

Reserving resources 

To improve the quality of a service, the user may arrange to reserve a set of resources in 
advance for his exclusive or high priority use. A calendaring system analogy can be used 
here. Such a reservation system can also be used in conjunction with planned hardware or 
software maintenance events, when the affected resource might not be available for grid use 
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In a scavenging grid. It may not be possible to reserve specific machines in advance. Instead, 
the grid management systems may allocate a larger fraction of Its capacity for a given 
reservation to allow for the likelihood of some of the resources becoming unavailable. This 
must be done in conjunction with tools that haye profiled the grid's workload capacity 
sufficiently to have reliable statistics about the grid's ability to serve the reservation. 



Using a grid: An administrator's perspective 

This section describes the typical usage activities In using the grid from an administrator's 
perspective. 

Planning 

The administrator should understand the organization's requirements for the grid to better 
choose the grid technologies that satisfy those requirements. The following sections briefly 
describe the steps the administrator may take to manage the grid. It is suggested that one 
should start by deploying a small grid first, to learn about its installation and management, 
before having to confront more complicated Issues involved with a large grid. 

Installation 

First, the selected grid system must be Installed on an appropriately configured set of 
machines. These machines should be connected using networks with sufficient bandwidth to 
other machines on the grid. Of prime importance is understanding the fail-over scenarios for 
the given grid system so that the grid can continue operating even If any of the management 
machines fails In some way Machines should be configured and connected to facilitate 
recovery scenarios. Any aitical data bases or other data essential for keeping track of the 
jobs in the grid, members of the grid, and machines on the grid should have suitable backups. 
Furthermore, public key certificates must be backed up and the private keys must be held in a 
highly secured place inaccessible by anyone else. 

After installation, the grid software may need to be configured for the local network address 
and IDs. The administrator will usually require root access to the machines managing the 
grid. In some grid systems, he will also need root access to the donor machines be required 
to install the software oh those as well. The software to be installed on the donor machines 
may need to be customized so that it can find the grid management machines automatically 
and include pre-installed public keys for the grid. This software may be provided to potential 
donors on an FTP or equivalent server or be made available on physical media. 

Once, the grid is operational, there may be application software and data that should be 
installed on donor machines as well. This software may have specific licensing restrictions 
that should be understood and adhered to. Some grid systems include tools to assist with 
grid-wide license management. This can both help in following the rules of the licenses and 
most efficiently exploit those licenses. 

Managing enrollment of donors and users 

An ongoing task for the grid administrator ts to manage the memberis of the grid, both the 
machines donating resources and the users. Users may be further organized as project 
groups. The administrator is responsible for controlling the rights of the users in the grid. 
Donor machines may have access rights that require management as well. Grid jobs running 
on donor machines may be executed under a special grid user ID on behalf of the users 
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submitting the jobs. The rights of these grid user IDs must be properly set so that grid jobs do 
not allow access to parts of the ckwjor machine to which the users are not entitled. 

As users join the grid, their identity must be positively established and entered in the 
certificate authority. The user and his certificate credentials must be added to the user Ust 
using the software appropriate for the grid system deployed. In some cases, the administrator 
must propagate the user information to several or alf grid machines. Also, when the grid 
system depends primarily on the operating system for user login, the administrator may need 
to add entnes to map the grid user to specific operating system user IDs on the donor 
machines. 

Similar enrollment activity is usually required to enroll donor machines into the grid The 
machine's identity is established and registered with the certificate authority The 
administrator of the grid must have an agreement with the administrator of the donor machine 
about user IDs. software, access rights, and any policy restrictions. The administrator must 
enter the machine's Identification credentials, addresses, and resource characteristics using 
the appropriate software for enrolling the donor machine into the grid, in some cases the 
administrator may need to manually propagate this information to other machines in the grid. 

Corresponding procedures for removing users and machines must be exeojted by the 
administrator. 



Certificate authority 



It IS cntical to ensure the highest levels of security in a grid because the grid is designed to 
execute code and not just share data. Thus, it can be fertile ground for viruses Trojan horses 
and other attacks if the grid system is compromised in any way. The certificate authority is * 
one of the most important aspects of maintaining strong grid security. An organization may 
Choose to use an external certificate authority or operate one itself. You must be able to trust 
the certificate authority to strictly adhere to its responsibilities. 

The primary responsibilities of a certificate authority are: 
*■ Positively identify entities requesting certificates 
»■ Issuing, removing, and archiving certificates 
»■ Protecting the certificate authority server 

Maintaining a namespace of unique names for certificate owners 
»• Serve signed certificates to those needing to authenticate entities 

Logging activity 

Briefly, a certificate authority is based on the public key encryption system. In this system 
keys are generated in pairs, a public key and a private key. Either one can be used to enciVot 
some data such that the other is needed to decrypt it. The private key is guarded by the 
owner and never revealed to anyone. The public one is ghren to anyone needing it A 
certificate authority is used to hold these public keys and to guarantee who they belonq to 
When a user uses his private key to encrypt something, the receiver uses the corresponding 
public key to decrypt it. The receiver knows that only that user's public key can decrypt the 
message correctly However, anyone could intercept this message and decrypt it because 
anyone can get the originator's public key. If the originator instead doubly encrypts the 
menage with his private key and the intended recipient's public key. a secure communication 
link IS formed. The receiver uses his private key to decrypt the message and then uses the 
sender's public key for the second decryption. Now the recipient knows that if the message 
decrypts property, then only the sender could have sent it and furthermore, the sender knows 
that only the intended receiver can decrypt it. The beauty of all of this is that nobody had to 



22 Fundamentals of Grid Computing 



securely carry an encryption key from the sender to the receiver, as must be done for 
conventional encryption systems, and any tampering with the communication is revealed. A 
similar exchange is used to get anyone's public key from the certificate authority, so that the 
user knows that he has received an unaltered public key for the desired user. 

Resource management 

Another responsibility of the administrator is to manage the resources of the grid. This 
includes setting permissions for grid users to use the resources as well as tracking resource 
usage and Implementing a corresponding accounting or billing system. Usage statistics are 
useful in identifying trends in an organization that may require the acquisition of additional 
hardware, reduction in excess hardware to reduce costs, and adjustments in priorities and ' 
policies to achieve utilization that is fairer or better achieves the overall goals of an 
organization. 

Some grid components, usually job schedulers, have provisions for enforcing priorities and 
policies of various kinds. It is the responsibility of the administrator to configure these to best 
meet the goals of the overall organization. Software license managers can be used in a grid 
setting to control the proper utilization. These may be configured to work with job schedulers 
to prioritize the use of the limited licenses. 

Data sharing 

For small grids, the sharing of data can be fairiy easy, using existing networked file systems, 
databases, or standard data transfer protocols. As a grid grows and the users become 
dependent on any of the data storage repositories, the administrator should consider 
procedures to maintain backup copies and replicas to improve performance. All of the 
resource management concerns apply to data on the grid. 



Using a grid: An application developer's perspective 

Grid applications can be categorized in one of the following three categories: 

► Applications that are not enabled for using multiple processors but can be executed on 
different machines. 

► Applications that are already designed to use the multiple processors of a grid setting. 

► Applications that need to be modified or rewritten to better exploit a grid. 

The latter category is of interest to grid application developers. They will find a need for tools 
for debugging and measuring the behavior of grid applications. Such grid based tools are still 
In their infancy. It may be useful for developers to configure a small grid of their own so that 
they can use debuggers on each machine to control and watch the detailed workings of the 
applications. Since the debugging process can bypass certain security precautions, it may 
not always be wise to allow such debugging on a production grid. 

Globus Is more a developer's toolkit for building grid components rather than a 
comprehensive grid system. It has the basic components needed to build new facilities to 
• manage grid operations, measurement, repair, and debug grid applications. Tools conforming 
to the emerging Open Grid Services Architecture (OGSA) interfaces will be usable on various 
vendor grid systems. 
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The present and the future 

The Globus toolkit is a set of tools useful for building a grid. Its strength is a good security 
model, with a provision for hierarchically collecting data about the grid, as well as the basic 
facilities for implementing a simple, yet world-spanning grid. Globus will grow over time 
through the work of many organizations that are extending Its capabilities. More information 
about Globus can be obtained at http ://»(*#(». globus. org. 

Most grid systems include some job schedulers, but as grids span wider areas, there will be a 
need for more meta-schedulers that can manage variously configured collections of clusters 
and smaller grids. These schedulers will evolve to better schedule jobs, consklering multiple 
resources rather than just CPU utilization. They will also extend their reach to implement 
better quality of service, using reservations, redundancy, and history profiles of jobs and arid 
performance. 

Tod^. grid systems are still at the early stages of providing a reliable, well performing and 
automatically recoverable virtual data sharing and storage. We will see products that take on 
this task in a gnd setting, federating data of all kinds, and achieving better performance 
integration with scheduling, reliability, and capacity. 

Autonomic computing has the goal to make the administrator's job easier by automating the 
vanous complicated tasks involved in managing a grid! These include identifying problems in 
real time and quickly initiating corrective acttons before they seriously impair the grid. 

Open Grid Services Architecture (OGSA) is an open standard at the base of all of these 
future grid enhancements. OGSA will standardize the grid interfaces that will be used by the 
new schedulers, autonomic computing agents, and any number of other servrces yet to be 
developed for the grid. It will make it easier to assemble the best products from various 
vendors, in^^ieasing the overall value of grid computing. More information about OGSA can 
be obtained at http://www.g1obus.org/ogsa. 



What the grid cannot do 



A word of caution should be given to the overly enthusiastic. The grid is not a silver bullet that 
can take any application and run it a 1000 times faster without the need for buying any more 
machines or software. Not every application is suitable or enabled for running on a grid 
Some kinds of applications simply cannot be parallelized. For others, It can take a large 
amount of work to modify them to achieve faster throughput. The configuration of a grid can 
greatly affect the performance, reliability, and security of an organization's computing 
infrastructure. For all of these reasons, it is important for the us to understand how far the grid 
has evolved today and which features are coming tomorrow or in the distant future 
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Abstract 

Metacomputing systems are intended to support remote 
and/or concurrent use of geographically distributed com' 
putational resources. Resource management in such sys- 
tems is complicated by five concerns that do not typi- 
cally, arise in other situations: site autonomy and het- 
erogeneous substrates at the resources, and application 
requirements for policy extensibility, co-al location, and 
online control. We describe a resource management ar- 
chitecture that addresses these concerns. This architec- 
ture distributes the resource management problem among 
distinct local manager, resource broker, and resource co- 
allocator components and defines an extensible resource 
. specification language to exchange information about re- 
quirements. We describe how these techniques have been 
implemented in the context of the Globus metacomputing 
toolkit and used to implement a variety of different re- 
source management strategies. We report on our experi- 
ences applying our techniques in a large testbed, GUSTO, 
incorporating 15 sites, 380 computers, and 3600 proces- 
sors. 



1 Introduction 

Metacomputing systems allow applications to assemble 
and use collections of computation al resources on an as- 
needed basis, without regard to physical location. Var- 
ious group)s are implementing such systems and explor- 
ing applications in distributed supercomputing, high- 
throughput computing, smart instruments, collaborative 
environments, and data mining [1 6, 12, 18, 20, 22, 6, 25]. 
This paper is concerned with resource management for 



metacomputing: that is, with the problems of locating 
and allocating computational resources, and with authen- 
tication, process creation, and other activities required 
to prepare a resource for use. We do not address other 
issues that are traditionally associated with scheduling 
(such as decomposition, assignment, and execution order- 
ing of tasks) or the management of other resources such 
as memory, disk, and networks. 

The metacomputing environment introduces five, chal- 
lenging resource management problems: site auton- 
omy, heterogeneous substrate, policy extensibility, co- 
allocation, and online control. 

1. The site autonomy problem refers to the fact that 
resources are typically owned and operated by dif- 
ferent organizations, in different administrative do- 
mains [5]. Hence, we cannot expect to see common- 
ality in acceptable use policy,' scheduling policies, so-' 
curity mechanisnis, and the like. 

2. The heterogeneous substrate problem derives from 
the site autonomy problem and refers to the fact that 
different sites may use different local resource mzm- 
agement systems [16], such as Condor [18], NQE [1], 
CODINE [11], EASY [17), LSF [28], PBS (14], and 
LoadLeveler [15]. Even when the same system is used 
at two sites, different configurations and local modi- 
fications often lead to significant. difTerences in func- 
tionality. 

3. The policy extensibility problem arises because meta- 
corriputing applications are drawn from a wide range 
of domains, each with its own requirements. A re- 
source management solution must support the fre- 

. quent development of new domain-specific manage- 



ment structures, without requiring changes to code 
installed at participating sites. 

. 4. The CO- allocation problem arises because many appli- 
cations have resource requirements that can be sat- 
isfied only by using resources simultaneously at sev- 
eral sites. Site autonomy and the possibility of fail- 
. ure during allocation introduce a need for specialized 
mechanisms for allocating multiple resources, initiat- 
ing computation on those resources, and monitoring 
and managing those computations. 

5. The online control problem arises because sutetan- 
tial negotiation can be required to adapt applica- 
tion requirements to resource availability, particu- 
larly when requirements and resource characteris- 
tics change during execution. For example, a tele- 
immersive application, that needs to simulate a new 
entity may prefer a lower-resolution rendering, if 
the alternative is that the entity not be modeled at 
all . Resource management mechanisms must support 
such negotiation. 

As we explain in Section 2, no existing resource man- 
agement systems addresses all Rve problems. Some batch 
queuing systems support co-allocation, but not site au- 
tonomy, policy extensibility, and online control [16]. Con- 
dor supports site autonomy, but not co-allocation or on- 
. line control [18]. Gallop [26] addresses online control and 
policy extensibility, but not the heterogeneous substrate 
or co-allocation problem. Legion [12] does not address 
the heterogeneous substrate problem. 

In this paper, we describe a resource management ar- 
chitecture that we have developed to address the five 
problems. In this architecture, developed in the con- 
text of the Globus project [10], we address problems of 
site autonomy and heterogeneous substrate by introduc- 
ing, entities called resource managers to provide a well- 
defined interface to diverse, local resource management 
tools, policies, and security mechanisms. To support on- 
line control and policy extensibility, we define an exten- 
sible resource specification language that supports nego- 
tiation between different components of a resource man- 
agement architecture, and we introduce resource brokers 
to handle the mapping of high-level application requests 
into requests to individual managers. We address the 
problem of co- allocation by defining various co-allocation 
strategies; which we encapsulate in resource co-allocators. 
One measure of success for an architecture such as this 
is its usability in a practical setting. To this end, we have 
implemented and deployed this architecture on GUSTO, 
a large computational grid testbed comprising 15 sites, 
330 computers, and 3600 processors, using LSF, NQE, 
LoadLeveler, EASY, Fork, and Condor as local sched- 
ulers. To date, this architecture and testbed .have been 



used by ourselves and others to implement numerous ap- 
plications and half a dozen different higher-level resource 
management strategies. This experiment represents a sig- 
nificant step forward in terms of number of global meta- 
computing services implemented and number and variety 
of commercial and experimental local resource manage- 
ment systems employed. A more quantitative evaluation, 
of the approach remains as a significant challenge for fu- 
ture work. 

The rest of this paper is structured as follows. In the 
next section, we review current distributed resource man- 
agement solutions. In subsequent sections we first outline 
our architecture and then examine each major function in 
detail: the resource specification language, local resource 
nianagers, resource brokers, and resource c<>allocators. 
We summarize the paper and discuss future work in Sec- 
tion 8. . 



2 Resource 
Approaches 



IVIanagement 



Previous work on resource management for metacomput- 
. ing systeins can be broken into two bro^ad classes: 

• Network hatch queuing systems. These systems focus 
strictly on resource management issues for a set of 
networked computers. These systems do not address 
policy extensibility and provide only limited support 
for online control and co- allocation. 

• Wide "Orea scheduling systems. Here, resource man- 
agement is performed as a component of mapping 
application components to resources and scheduling 
their execution. To date, these systems do not ad- 
dress issues of heterogeneous substrates ^ site auton- 
omy, and co-allocation. 

In the following, we use representative examples of these ' 
two types of system to illustrate the strengths and weak- 
nesses of current approaches. 

2.1 Networked Batch Queuing Systems 

Networked batch queuing systems, such as NQE [1], CO- 
DINE [11], LSF [28], PBS [14], and LoadLeveler [15], han- 
dle user-submitted jobs by allocating resources from a 
networked pool of computers. The user characterizes ap- 
plication resource requirements either explicitly, by some 
type of job control language, or implicitly, by selecting 
the queue to which a request is submitted. Networked 
batch queuing systems typically are designed for single 
administrative domains, making site autonomy difficult to 
achieve. • Likewise, the heterogeneous substrate problem " 
is also an issue because these systems generally assume 



that they are the only resource management system in 
operation. One exception is the CODINE system, which 
introduces the concept of a transfer queue to allow jobs 
submitted to CODINE to be allocated by some other re- 
source management system, at a reduced level of function- 
ality. An alternative approach to supporting substrate 
heterogeneity is being explored by the PSCHED [13] ini- 
tiative. This project is attempting to define a uniforni 
API through which a variety of batch scheduling systems 
may be controlled. The goals of PSCHED are similsir in 
many ways to those of the Globus Resource Allocation 
Manager described in Section 5. . 

Batch scheduling systems provide a limited form of pol- 
icy extensibility in that resource management policy is set 
by cither the systeni or the system administrator, by the 
creation of scheduling policy or batch queues. However, 
this capability is not available to the end users, who have 
little control over how the batch scheduling system inter- 
prets their resource requirements. 

Finally, we observe that batch queuing systems have 
limited support for on-line allocation, as these systems 
are designed to support applications in which the require-, 
ments specifications are in the form "get X done soon" , 
where X is precisely defined but "soon" is not. In meta- 
computing applications, we have more complex, fluid con- 
straints, in which we will want to make tradeoffs between 
time "(when) and space (physical characteristics). Such 
constraints lead to a need for the resource management 
system to provide capabilities such as negotiation, inquiry 
interfaces, information-based control, and co-allocation, 
none of which are provided in these systems. 

In sunfimary, batch scheduling systems do not provide 
in themselves a complete solution to metacomputing re- 
source management problems. However, clearly some 
of the mechanisms developed for resource location, dis- 
tributed process control, remote file access', to name a 
few, can be applied to wide-area systems as well. Further- 
more, we note that network batch queuing systems will 
necessarily be part of the local resource management so- 
lution. Hence, any metacomputing resource management 
architecture must be able to interface to these systems. 

2.2 Wide-Area Scheduling Systems 

•We now examine how resource management is addres.sed 
within systems developed specifically to schedule meta- 
computing applications. To gain a good perspective on 
the range of possibilities, we discuss four different sched- 
ulers, designed variously to support specific classes of a]> 
plications (Gallop (26)), an extensible object-oriented sys- 
tem (Legion [12]), general classes of parallel programs 
(PRM [22]), and high-throughput computation (Con- 
dor [18]). 

The Gallop [26] system allocates and schedules tasks 



defined by a static task graph onto a set of networked 
computational resources. (A similar mechanism has been 
used in Legion [27].) Resource allocation is implemented 
by a scheduling manager, which coordinates scheduling 
requests, and a local manager, which manages the re- 
sources at a local site, potentially interfacing to site- 
specific scheduling and resource allocation services. This 
decomposition, which we also adopt, separates local re- 
source management operations from global resource man- 
agement policy and hence facilitates solutions to the prob- ^ 
lems of site autonomy, heterogeneous substrates, arid polr 
icy extensibility. However, Gallop does not appear to 
handle authentication to local resource management ser- 
vices, thereby limiting the level of site autonomy that can 
be achieved. 

The use of a static task-graph model makes online 
control in Gallop difficult. Resource selection is per- 
formed by attempting to minimize the execution time 
of task graph as predicted by a performance model for 
the application and the prospective resource. How- 
ever, because the minimization proceduTie and the cost 
model is fixed , there is no support, for policy extensibil- 
ity. Legion [12] overcomes this limitation by leverag- 
ing its object-oriented model. Two specialized objects, 
an application-specific Scheduler and a resource-specific 
Enactor negotiate with one another to make allocation 
decisions. The Enactor can also provide co-allocation 
functions. 

Gallop supports co- allocation for resources maintained 
within an administrative domain, but depends for this 
purpose on the ability to reserve resources. Unfortu- 
nately, reservation is not currently supported by most 
local resource management systems. For this reason, our 
architecture does not rely on reservation to perform co- 
allocation, but rather uses a separate co-allocation man- 
agement service to perform this function. 

The Prosper© Resource Manager [22] (PRM) pro-, 
vides resource management functions for parallel pro- 
grams written by using the PVM message-passing library. 
PRM consists of three components: a system manager, a 
job manager, and a node manager. The job manager 
makes allocation decisions, while the system and node 
manager actually allocate resources. The node manager 
is solely responsible for implementing resource allocation 
. functions. Thus, PRM does not address issues of site au- 
tonomy or substrate heterogeneity. A variety of job man- 
agers can be constructed, allowingfor policy extensibility, 
although there is no provision for composing job man- 
agers so as to extend an existing management policy. As 
in our architecture, PRM has both an information infras- 
tructure (Prospero [21]) and a management API, provid- 
ing the infra.structure needed to perform online control. 
However, unlike our architecture, PRM does not support 
co-allocation of resources. 



Condor [18] is a resource management system de- 
signed to support high-throughput computations by dis- 
covering idJe resources on a network and allocating those 
resources to application tasks. While Condor does not 
interface with existing resource management systems, re- 
sources, controlled by Condor are deallocated as soon as 
the "rightful" owner starts to use them. In this sense, 
Condor supports site autonomy and heterogeneous sub- 
strates. However, Condor currently does not interoperate 
with local resource authentication, limiting the degree of 
autonomy a site can assert. Condor provides an exten- 
sible resource description language^ called classified ads^ 
- which provides limited control over resource selection to 
both the application and resource. However, the match- 
ing of application component to resource is performed 
by a system classifier, which defines how matches — and 
consequently resource management — ^take place, limiting 
the extensibility of this selection policy. Finally, Condor 
provides no support for co- allocation or online control. 

In summary,, our review of current resource manage- 
ment, approaches revealed a range of valuable services, 
but no single system that provides solutions to all five 
metacomputing resource management problems posed in 
the introduction. 

3 Our Resource Management Ar- 
chitectin*e 

Our approach to the metacomputing resource manage- 
ment problem is illustrated in Figure 1. In this architec- 
ture, an extensible resource specification language (RSL), 
discussed in Section 4 below, is used to communicate re- 
quests for resources between components: from applica- 
. tions to resource brokers, resource co-allocators, and re- 
isource managers. At each stage in this process, infor- 
mation about resource requirements, coded as an RSL 
expression by the application, is refined by one or more 
resource brokers and co- allocators; information about re- 
source availability and characteristics is obtained from an 
information service. 

Resource brokers are responsible for taking high-level 
RSL specifications and transforming them into more con- 
crete specifications through a process we call specializa- 
tion. As illustrated in Figure 2, multiple brokers may be 
involved in servicing a single request, with application- 
specific brokers translating application requirements into 
more concrete resource requirements, and different re- 
source brokers being used to locate available resources 
that meet those requirements. 

Transformations effected by resource brokers generate 
a specification in which the locations of the required re- 
sources are conripletely specified. Such a ground request 
can be passed to a co-allocator^ which is responsible for 



coordinating the allocation and management of resources 
at multiple sites. As we describe in Section 7, a variety of 
co-allocators will be required in a metacomputing system, 
providing diflerent . co-allocation semantics. 

Resource, co-allocators break a multirequest — that is, 
a request involving resources at multiple sites— into its 
constituent elements and pass each component to the ap- 
propriate resource manager. As discussed in Section 5, 
each resource manager in the system is responsible for 
taking an RSL request and translating it into operations 
in the. local, site-specific resource management system. 

The information service is responsible for providing ef- 
ficient and pervasive access to information about the cur- 
rent availability and capability of resources. This infor- 
mation is used to locate resources with particular charac- 
teristics, to identify the resource manager associated with • 
a resource, to determine properties of that resource, and 
for numerous other purposess as high-level resource specifi- 
cations are translated into requests to specific managers. 
We use the Globus system's Metacomputing Directory 
Service (MDS) [8] as our information-service. MDS uses 
the data representation and application programming in- 
terface (API) defined on the Lightweight Directory Access 
Protocol (LDAP) to meet requirements for uniformity, 
extensibility, and distributed maintenance. It defines a 
data model suitable for distributed computing applica- 
tions, able to represent computers and networks of inter- 
est, and provides tools for populating this data model. 
LDAP defines a hierarchical, tree-structured name space 
called a directory information tree (DIT). Fields within 
the namespace are identified by a unique distinguished 
name (DN). LDAP supports both distribution and repli- 
cation. Hence, the local service associated with MDS is . 
exactly an LDAP server (or a gateway to another LDAP 
server, if multiple sites share a server), plus the utilities 
used to populate this server with up-to-date information 
about the structure and state of the resources within that 
site. The global MDS service is simply the ensemble of all . 
these servers. An advantage of using MDS as our infor- 
mation service is that resource management information 
can be used by other tools, as illustrated in Figure 3. 



4 Resource 
Language 



Specification 



We now discuss the resource specification language itself. 
The syntax of an RSL specification, summarized in Fig- 
ure 4, is based on the syntax for filter specifications in 
the Lightweight Directory Access Protocol and MDS. An 
RSL specification is constructed by combining simple pa- 
rameter .specifications and conditions with the operators 
ft; to specify conjunction of parameter specifications, I ; to 
express the disjunction of parameter specifications, +; or 
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Figure 1: The Globus resource managenient architecture, showing how RSL specifications pass between application, 
resource brokers, resource co-allocators, and local managers (GRAMs). Notice the central role of the information 
service. 



to combine two or more requests into a single compoimd 
request, or niultirequest. 

The set of paramet er-name terminal symbols is exten- 
sible: resource brokers, co- allocators, and resource man- 
agers can each define a set of parameter names that they 
will recognize. For example, a resource broker that is 
specialized for tele-immersive applications might accept 
as input a specification containing a frame s-per-second 
parameter and might . generate as output a specifica- 
tion containing an mf lops-per-second parameter, to be 
passed to a broker that deals with computational re- 
sources. Resource managers, the system cornponents that 
actually talk to local scheduling systems, recognize two 
types of parazQeter-nane terminal symbols: 

• MDS attribute names, used to express constraints on 
resources: for example, memory>=64 or network^atm. 
In this case, the parameter name refers to a field de- 
fined in the MDS entry for the resource being allo- 
cated. The truth of the parameter specification is 
determined by comparing the value provided with 
the specification with the current value a.s.sociated 
with the corresponding field in the MDS. Arbitrary 
MDS fields can be specified by providing their full 
distinguished name. 

• Scheduler parameters, used to communicate infor- 
mation regarding the job, such as count (number 



of nodes required), maz.tine (maximum time re- 
quired), executable, argiu&ents, directory, and 
environment (environment variables). Schedule pa- 
rameters are interpreted directly by the resource 
manager. 

For example, the specification 

&(ezecatable=myprog) 
( I (ft(count=5) (memory >.=64) ) 
(ftCcount^lO) (mejBory>=32) ) ) 

requests 5 nodes with at least 64 MB memory, or 10 
nodes with at least 32 MB. In this request, executable 
and count are scheduler attribute names, while memory 
is an MDS attribute name. 

Our current RSL parser and resource manager disam- 
biguate these two parameter types on- the basis of the 
parameter name. That is, the resource manager knows 
which fields it will accept as scheduler parameters and as- 
sumes all others are MDS attribute names. Name clashes 
can be disambiguated by using the complete distinguished 
name for the MDS field in question. 

The ability to include constraints on MDS attribute 
values in RSL specifications is important. As we discti.ss 
in Section 5, the state of resource managers is stored in 
MDS. Hence, resource specifications can refer to resource 
characteristics such as queue-length, expected wait time, 
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Figure 2:. This view of the Globus resource management architecture shows how different types of broker can 
participate in a single resource request 
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Figure 3:- The GlobusView tool uses MDS information about resource manager status to present information about 
the current status of a metacomputing testbed. On the left, we see the sites that are currently participating in the 
testbed; on the right is information about the total number of nodes that each site is contributing, the number of 
those nodes that are currently available to external users, and the usage of those nodes by Globus users. 



specification := request 

request . := multirequest ) conjunction | disjunction | parameter . 

muHirequest := + request-list 

conjunction := & request-list 

disjunction := I request-list 

request-list := { request ) request-list | ( request ) 

parameter := parameter- name op value 

op :==|>|<|>=|<=| • = 

value := ([a..Z][0..9]H)+ 



Figure 4: BNF grammar describing the syntax of an RSL request 



and number of processors available. This technique pro- 
vides a powerful mechanism for controlling how an RSL 
specification is interpreted.. 

The following example of a multirequest is derived from 
the example shown in Figure 2. 

+(ft(coimt=80) (memory>=64M) 

(execTitabie=sf_express) 

(resourcenianager=i col6 . mcs . anl . gov : 87 1 1) ) 
(&(coTint=256) (network=atm) 

(executable=sf.ezpres;5) 

(resourcemanager= 
neptune 1 cacr . caltech . edu: 755) ) 
.(&(count=300) (memory>=64M) 
. (executable=sf_express) 

(resourcemaiiager=inodi4 . ncsa . edu :4000) ) 

This is a ground request: every component of the multire- 
quest specifies a resource manager. A co-allocator can use 
the resourcemanager parameters specified in this request 
to determine to which resource manager each component 
of the multirequest should be submitted. 

Notations intended for similar purposes include the 
Condor "classified ad" [18] and Chapin's "task descrip- 
tion vector" [5]. Our work is novel in . three respects: 
the tight integration with a directory service, the use of 
specification rewriting to express broker operations (as " 
described below), and the fact that the language and as- 
sociated tools have been implemented and demonstrated 
effective when layered on top of numerous different low- 
level schedulers. 

We conclude this section by noting that it is the combi- 
nation of resource, brokers, information service, and RSL 
that makes online control possible in our architecture. 
Together, these services make it possible to construct re- 
quests dynamically, based on current system state and 
negotiation between the application and the underlying 
resources. 

5 Local Resource Management 

We now describe the lowest level of our resource manage- 
ment architecture: the local resource managers, imple- 
mented in our architecture as Globus Resource Allocation 
Managers (GRAMs). A GRAM is responsible for . 

1. processing RSL specifications representing resource 
requests, by either denying the request or by creat- 
ing one or more processes (a "job**), that satisfy that 
request; 



A GRAM serves as the interface between a wide area 
metacomputing environment and an autononnous entity 
able to create processes, such as a parallel computer 
scheduler or a Condor pool: Hence, a resource manager 
need not correspond to a single host or a specific com- 
puter, but rather to a service that acts on behalf of one or 
more computational resources. This use of local scheduler 
interfaces was first explored in the software environment 
for the I- WAY networking experiment [9], but is extended 
and generalized here significantly to provide a richer and 
more flexible interface. 

A resource specification passed to a GRAM is assumed 
to be ground: that is, to be sufficiently concrete that the 
GRAM can identify local resources tKat meet the speci- 
fication without further interaction with the entity that* 
generated the request. A particular GRAM implementa- 
tion may achieve this goal by scheduling resources itself 
or, more commonly, by mapping the resource specification 
into a request to some local resource allocation mecha- 
nisms. {To date, we have interfaced GRAMs to six dif- 
ferent schedulers or resource allocators: Condor, EASY, 
Fork, LoadLeveler, LSF, and NQE.) Hence, the GRAM 
API plays for resource management a similar role to that 
played by IP for communication: it can co-exist with local 
mechanisms, just as IP rides on top of-ethernet, FDDI, 
or ATM networking technology. 

The GRAM API provides functions for submitting and 
for canceling a job request and for asking when a job (sub- 
mitted or not) is expected, to run. An implementation of 
the latter function may use queue time estimation tech- 
niques [24]. When a job is submitted, a globally unique 
job handle is returned that can then be used to moni- 
tor and control the progress of the job. In addition, a 
job submission call can request that the progress of the 
requested job be signaled asynchronously to a supplied 
callback URL Job handles can be passed to other pro- 
cesses, and callbacks do not have to be directed to the 
process that submitted the job request. These features 
of the GRAM design facilitate the implementation of di- ' 
verse higher-level scheduling strategies. For example, a 
high-level broker or co-allocator can make a request on 
behalf of an application, while the application monitor 
the progress of the request. 

5.1 GRAM Scheduling Model 



2. enabling remote monitoring and management of jobs 
created in response to a resource request; and 

3. periodically updating the MDS information service 
with information about the current availability and 
capabilities of the resources that it manages. 



We discu.ss briefly the scheduling model defined by 
GRAM because this is relevant to subsequent discussion 
of co-allocation. This model is illustrated in Figure 5, 
which shows the state transitions that may be experi- 
enced by a GRAM job. 

When submitted, the job is initially pending, indi- 
cating that resources have not yet been allocated to the 
job. At some point, the job is allocated the requested 
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Figure 5: State transition diagram for resource allocatio] 
API . 

resources, and the application starts running. The job 
then transitions to the active state. At any point prior 
to entering the done state, the job can be terminated, 
causing it to enter the failed state. A job can fail be- 
cause of explicit termination, an error in the format of 
the request, a failure in the underlying resource manage- 
ment system, or a denial of access to the resource. The 
source of the failure is provided as part of the notification 
of state transition. When all of the processes in the job 
have terminated and resources have been deallocated, the 
job enters the done state. 

5.2 GRAM Implementation 

. The GRAM implementations that we have constructed 
have the structure shown in Figure 6. The principal com- 
ponents are the GRAM client library, the gatekeeper, the 
RSL parsing library, the job manager, and the GRAM re- 
porter. The Globus security infrastructure (GSI) is used 
for authentication and for authorization. 
. The GRAM client library \s used by an application or 
a co-allocator acting on behalf of an application. It inter- 
acts with the GRAM gatekeeper at a remote site to per* 
form mutual authentication and transfer a request, which 
includes a resource specification and a callback (described 
below). 

The gatekeeper is an extremely simple component that 
responds to a request by doing three tasks: performing 
mutual authentication of user and resource, determining 
a local user name for the remote user, and starting a 
job manager which executes as that local user and actu- 
ally handles the request. The first two security- related 
tasks are performed by calls to the Globus security in- 
frastructure (GSI), which handles issues of site autonomy 
and substrate heterogeneity in the sec\jrity domain. To 
start the job manager, the gatekeeper must run as a privi- 
leged program: on Unix systems, this is achieved via suid 
or inetd. However, because the interface to the GSI is 
small and well defined, it is easy for organizations to ap- 
prove (and port) the gatekeeper code. In fact, the gate- 
keeper code has successfully undergone security reviews 
at a number of large supercomputer centers. The map- 



requests submitted to the GRAM resource management. 



ping of rernote user to locally recognized user name min- 
imizes the aimount of code that must run as a privileged 
program; it also allows us to delegate most authorization 
issues to the local system. 

The job manager is responsible for creating the ac- 
tual processes requested by the user. This task. typically 
involves submitting a resource allocation request to the 
underlying resource management system, although if no 
such system exists on a particular resource, a simple fork 
may be performed. Once processes are created, the job 
manager is also responsible for monitoring the state of 
. the created processes, notifying the callback contact of 
any state transitions, and implementing control opera- 
tions such as process termination. The job manager ter- 
minates once the job for which it is responsible has ter- 
minated. 

The GRAM reporter is responsible for storing into 
MDS various information about scheduler structure (e.g., 
whether the scheduler supports reservation and the num- 
ber of queues) and state (e.g., total number of nodes, 
number of nodes currently available, currently active jobs, 
and expected wait time in a queue). An advantage of im-* 
plementing the GRAM reporter as a distinct component 
is that MDS reports can continue even when no gate- 
keeper or job manager is running: for example, when the 
gatekeeper is run from inetd. 

As noted above, GRAM implementations have been 
constructed for six local schedulers to date: Condor, LSF, 
NQE, Fork, EASY, and LoadLcveler. Much of the GRAM 
code is independent of the local scheduler, and so only a 
relatively small amount of scheduler-specific code needed 
to be written in each case. In most cases, this code com- 
prises shell scripts that use the local scheduler's user-level 
API. State transitions are handled mostly by pollings be- 
cause this proved to be more reliable than monitoring 
job processes by using mechanisms provided by the local 
schedulers. 

6 Resource Brokers 

As noted above, we use the term resource broker to denote 
an entity in our architecture that translates abstract re- 
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Figure 6: Major components of the GRAM implementation. Those represented by thick-lined oyals are long-lived 
processes, while the thin-lined ovals are short-lived processes created in response to a request. 



source specifications into more concrete specifications. As 
illustrated in Figure 2, this definition is broad enough to 
encompass a variety of behaviors, including application- 
level schedulers [3] that encapsulate information about 
the types of resource required to meet a. particular per- 
formance requirement, resource locators that maintain in- 
formation about the availability of various types of re^ 
source, and (ultimately) traders that create markets for 
resources. In each case, the broker uses information liiain- 
tained locally, obtained from MDS, or contained in the 
specification to specialize the specification, mapping it 
into a new specification that contciin more detail. Re- 
quests can be paissed to several brokers, effectively com- 
posing the behaviors of those broker?s, until eventually the 
specification is specialized to the point that it identifies 
a specific resource manager. This specification can then 
be passed to the appropriate GRAM or, in the case of a 
multirequest, to a resource co-allocator. 



We claim that our architecture makes it straightfor- 
ward to develop a variety of higher-level schedulers. In 
support of this claim, we note that following the defini- 
tion and implementation of GRAM services, a variety of 
people, including people not directly involved in GRAM 
definition, were able to construct half a dozen resource 
brokers quite quickly. We describe three of these here. 



6.1 Nimrod-G 

David Abramson and Jonathan Giddy are using GRAM 
mechanisms to develop Nimrod-G, a wide-area version 
of the Nimrod [2] tool. Nimrod automates the creation 
and management of large parametric experiments. It 
allows a user to run a single application under a wide 
range of input conditions and then to aggregate the re- 
sults of these different runs for interpretation. In effect/ 
Nimrod transforms file-based programs into interactive 
"meta-applications" that invoke user programs much as 
we might call subroutines. 

When a user first requests that a computational exper-. 
iment be performed, Nimrod/G queries MDS to locate 
suitable resources. It uses information in MDS entries 
to identify sufficient nodes to perform the experiment. 
The initial Nimrod-G prototype operates by generating 
a number of independent jobs, which are then allocated 
to computational nodes using GRAM. This module hides 
the nature of the exeaition mechanism on the underly- 
ing platform from Nimrod, hence making it possible to 
schedule work using a variety of different queue managers 
without modification to the Nimrod scripts. As a result, 
a reasonably complex cluster computing system could be 
retargeted for wide-area execution with relatively little 
effort. 

In the future, the Nimrod-G developers plan to pro- 



vide a higher level broker that allows the user to spec- 
ify time and cost constraints. These constraints will be 
used to select computational nodes that can meet user 
requirements for time and cost or, if constraints cannot 
be met, to explain the nature of the cost/ time tradeoffs. 
As part.of this work, a dynamic resource allocation mod- 
ule is planned that will monitor the state of each system 
and relocate work when necessary in order to meet the 
deadlines. 

6.2 AppLeS 

Rich Wolski has used GRAM niechanisms to construct 
an application-level scheduler (AppLeS) (3] for a large, 
loosely coupled problem from computational mathemat- 
ics. As in Nimrod-G, the goal was to map a large num- 
ber of independent tasks to a dynamically varying pool of 
available computers. GRAM mechanisms were used to lo- 
cate resources (including parallel computers) and to initi- 
ate and manage computation on those resources. AppLeS 
itself provided fault tolerance, so that errors reported by 
GRAM would result in a task being resubmitted else- 
where. 

6.3 A Graphical Resource Selector 

The graphical resource selector (GRS) illustrated in Fig- 
ure 7 is an example of an interactive resource selector 
constructed with our services. This Java application al- 
lows the user to build up a network repressenting the re- 
sources required for an application; another network can 
be constructed to monitor the status of candidate phys- 
ical resources. A combination of automatic and manual 
techniques is then used to guide resource selection, even- 
tually generating an RSL specification for the resources 
in question. MDS services are used to obtain the infor- 
mation used for resource monitoring and selection, and 
resource co-allocator services are used to generate the 
GRAM requests required to execute a program once a 
resource selection is made. 

7 Resource Co-allocation 

Through the actions of one or more resource brokers, the 
requirements of an application are refined into a ground 
RSL expression. If the expression consists of a single re- 
source request, it can be .submitted directly to the man- 
ager that controls that resource. However, as discussed 
above, a metacomputing application often requires that 
.several resources — such as two or more computers and 
intervening networks — be allocated simultaneously. In 
these ca.$es, a resource broker produces a multirequest, 
and co-allocation is required. The challenge in responding 



to a co-allocation request is to allocate the requested .re- 
sources in a distributed environment, across two or more 
resource managers, where global state, such as availability 
of a set of resources, is difficult to determine. 

Within our resource management architecture, multi- 
requests are handled by an entity called a resource co- 
allocator. In brief, the role of a co-allocator is to split a re-, 
quest into its constituent components, submit each com^ 
ponent to the appropriate resource manager, and then 
provide a means for manipulating the resulting set of re- 
sources as a whole: for example, for monitoring job sta- 
tus or terminating the job. Within these general guide- 
lines, a range of different co-allocation services can be 
constructed. For example, we can imagine allocators that 

• mirror current GRAM semantics: that is, require all 
resources to be available before the job is allowed 
to proceed, and fail globally if failure occurs at any 
resource; 

• allocate at least N out of M requested resources and 
then return; or 

• return immediately, but gradually return more re- 
sources, as they become available.^_ 

Each of these services is useful to a class of applications. 
To date, we have had the most experience with a co- 
allocator that takes the first of these approaches: that 
is, extends GRAM semantics to provide for simultaneous 
allocation of a collection of resources;^ enabling the dis- 
tributed collection of processes to be treated as a unit. 
We discuss this co-allocator in more detail. 

Fundamental to a G RAM-style concurrent allocation 
algorithm is the ability to determine whether the desired . 
set of resources is available at some time in the future. If 
the underlying local schedulers support reservation, this 
question can be easily answeried by obtaining a list- of 
available time slots from each participating resource man- 
ager, and choosing a suitable timeslot [23]. Ideally, this 
scheme would use transaction-based reservations across a 
set of resource managers, as provided by Gallop [26]. In 
the absence of transactions, the ability either to make a 
tentative reservation or to retract an existing reservation 
in needed. However, in general, a reservation-based strat- 
egy is limited because currently deployed local resource 
management solutions do not support reservation. 

In the absence of reservation, we are forced to use in- 
direct methods to achieve concurrent allocation. These 
methods optimistically allocate resources in the hope that 
the desired set will be available at some "reasonable" time 
in the future. Guided by sources of information, .such as 
the current availability of resources (provided by MDS) or 
queue-time estimation [24, 7], a resource broker can con- 
struct an RSL request that is likely, but not guaranteed, 
to succeed. If for some reason the allocation eventually 



. Figure 7: A screen shot of the Graphical Resource Selector. This network shows three candidate resources and 
associated network connections. Static information regarding operating system version and dynamically updated 
information regarding the number of currently available nodes (freenodes) and network latency and bandwidth (in 
msec and Mb/s, respectively), allows the user to select appropriate resources for a particular experiment. 



fails, all of the started jobs must be terminated. This 
approach has several drawbacks: 

• It is inefficient in that computational resource are 
wasted while waiting for all of the requested to be- 
come available. 

• We need to ensure that application components do 
not start to execute before the co-allocator can de- 
termine whether the request will succeed. Therefore, 
the application must perform a barrier operation 
to synchronize startup across components, meaning 
that the application must be altered beyond what is 
required for GRAM. 

• Detecting failure of a request can be difficult if 
some of the request comp>onents are directed to re- 
sburce managers that interface to queue-based local 
resource management systems. In these situations, a 
timeout must be used to detect failure. 

However, in spite of all of these drawbacks, co-allocation 
can frequently be achieved in practice as long as the re- 
source requirements are not large compared with the ca- 
pacity of the metacomputing system. 

We have implemented a G RAM-compatible co^ 
allocator that implements a job abstraction in which mul- 
tiple GRAM subjobs are collected into a single.distributed 
job entity. State information for the distributed job is 
synthesized from the individual states of each subjob, 
and job control (e.g., cancellation) is automatically propa- 
gated to the resource managers at each subjob site. Sub? 
jobs arc started independently and as discussed above 
must perform a runtime check-in operation. With the 
exception of this check-in operation, the co-allbcator in- 
terface is a drop-in replacement for GRAM. ; 

We have used this co-allocator to manage resources 
for SF-Express [19, 4], a large-scale distributed interac- 
tive simulation application. Using our co-allocator and 
the GUSTO testbed, we were able to simultaneously ob- . 
tain 852 compute nodes on three different architectures 
located at six different computer centers, controlled by 
three different local resource managers. The use of a co- 
allocation service significantly simplified the process of 
•resource allocation, and application startup. 

Running SF-Express "at scale" on a realistic testbed 
allowed us to study the scalability of our co-allocation 
strategy. One clear lesson learned is that the strict "all 
or nothing' semantics of the distributed job abstraction 
severely limits scalability. Even if each individual paral- 
lel computer is reasonably reliable and well understood, 
the probability of subjob failure due to improper con- 
figuration, network error, authorization difficulties, and 
the like, increases rapidly as the number of subjobs in- 
creases. Yet many such failure modes resulted simply 



from a failure to allocate a specific" instance of a com- 
modity resource, for which an equivalent resource could 
easily have been substituted. Because such failures fre- 
quently occur after a large number of subjobs have been 
successfully allocated, it would be desirable to make the 
substitution dynamically, rather than to cancel all the 
allocations and start over. 

We plan to extend the current co-allocation structure' 
to support such dynamic job structure modification. By 
passing information about the. nature of the subjob fail- . 
ure out of the co- allocator, a resource broker can edit the 
specification, effectively implementing a backtracking al- 
gorithm for. distributed resource allocation. Note that we 
can encode the necessary information about failure in a 
modified version of the. original RSL request, which can 
be returned to the component that originally requested 
the co-allocation services. In this way, we can iterate 
through the resource-broker/co-allocation components of 
the resource management architecture until an acceptable 
collection of resburces has been acquired on behalf of the 
application. 

8 Conclusions 

We have described a resource management architecture 
for metacomputing systems that addresses requirements 
of site autonomy, heterogeneous substrates, policy exten- 
sibility, co-allocation, and online control. This architec- 
ture has been deployed and applied successfully in a large 
testbed comprising 15 sites, 330 computers, and 3600 pro^ 
cessors, within which LSF, NQE, LoadLeveler, EASY, 
Fork, and Condor were used as local schedulers. 

The primary focus of our future work in this area will be 
on the development of more sophisticated reisource broker 
and resource co-allocator services within our architecture, 
and on the extension of our resource management archi- 
tecture to encompass other resources such as disk and 
network. We arc also interested in the question of how 
policy information can be encoded so as to facilitate au- 
tomatic negotiation of policy requirements by resources, 
users, and processes such as brokers acting as intermedi- 
aries. 
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A range of grid-related questions frequently asked by IBMers and 
customers alike. 

* What is a arid? 

* What effe ct does g rid have on users whose machines are being 

utilized for processing ? 

* Is g rid computin Q available today — or Is it more of a future 
statement? 

^ What industries are using g rid com puting now? 

^ What are the possible benefits of a grid deployment? 

^ What is IBM's relationship with a rid computing? 

^ Grid and e-business on demand: what's the connection? 

* Does IBM use a rid com puting in its own infrastructure? 

* If I want to learn more about IBM Grid Computing, what's the 
first step? 

^ What does it take to build a arid? 

* What about security in grid environments? 

What is a grid? 

All or some of a group of computers, servers and storage across an 
enterprise, virtualized as one large computing system. Because grids 
unleash latent power that, at any one time, is not being used, they 
can give companies a huge gain in power, speed and collaboration, 
radically accelerating compute-intensive processes. Cost, 
meanwhile, can remain low, as grids can be built using existing 
infrastructure, helping to ensure optimal utilization of computing 
capabilities. 

Back to top 

What effect does grid have on users whose machines are being 
utilized for processing? 

Grids are designed to be seamless and transparent . A user whose 
desktop PC, say, is contributing processing power to the grid will 
experience no negative effects: the grid runs in the background, 
utilizing available resources when needed by the system. If the PC 
user decides to run an application that requires more processing 
power, the work currently being processed on that machine will be 
dynamically reallocated to another machine in the grid with 
available processing power. 
^ Back to top 



Is grid computing available today — or is it more of a future 
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statement? 

Grid computing is used today by many companies across a number 
of industries. Current IBM customer references for grid include 
Butterfly.net, a development studio, online publisher and 
infrastructure provider for massively multiplayer games that connect 
players on PC's, consoles and mobile devices. Butterfly Grid 
consists of two clusters of approximately 50 IBM ©server™ 
xSeries™ servers running in IBM hosting facilities. Specialized 
game servers and database servers are fully meshed over high-speed 
fiber-optic lines, enabling transparent routing of players to different 
servers in the grid. Another current reference for IBM Grid 
Computing is the University of Pennsylvania's groundbreaking 
National Digital Mammography Archive, which gives rapid retrieval 
of digital patient files from multiple locations in a secure 
environment. The University of Pennsylvania Grid manages this 
huge data volume, schedules traffic and encrypts all image and 
infonnation transmission using portal systems running almost 
exclusively on IBM hardware — including sixteen distributed IBM 
Netfinity servers running Linux and Windows 2000. 
^ Back to top 

What industries are using grid computing now? 

Some examples include: Automotive and aerospace, for 
collaborative design and data-intensive testing; financial services, 
for running long, complex scenarios and arriving at more accurate 
decisions; life sciences, for analyzing and decoding strings of 
biological and chemical information; government, for enabling 
seamless collaboration and agility in both civil and militaiy 
departments and agencies; higher education for enabling advanced, 
data and compute intensive research. 
^ Back to top 

What are the possible benefits of a grid deployment? 

Benefits can be extensive. They include: 

• Accelerated time to results, which allows for the provisioning of 
extra time and resources to solve problems that were previously 
unsolvable 

• Improved productivity and collaboration 

• Allowing widely dispersed departments and businesses to create 
virtual organizations to share data and resources 

• More flexible, resihent operational infrastructures 

• Instantaneous access to compute and data resources to "sense and 
respond" to needs 

• Leveraging existing capital investments, which helps to ensure 
optima] utilization of computing capabilities 

• Avoiding common pitfalls of over-provisioning and incurring 
excess costs 

• Freeing IT organizations from the burden of administering 
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disparate, non-integrated systems 
^Back to top 

What is IBM*s relationship with grid computing? 

IBM views grid computing as critical to the ongoing development of 
on demand operating environments. For all the excitement and 
innovation that grid represents, much of the thinking and technology 
that drive grid are anyfliing but new to IBM. IBM was an early 
leader in "virtualization" — the driving force behind grid computing 
— which has enabled the computer to do many processing jobs 
simultaneously for thousands of users. Grid computing is an 
advanced evolution of virtualization — and IBM Grid Computing 
continues IBM's history of IT innovation for business. Deep 
experience in e-business processes, support for open standards, 
enabling of our products and services for grid, partnership role in the 
grid community and relationships with Business Partners make IBM 
an important force in bringing the benefits of grid to enterprise 
computing. 

^ Back to top 

Grid and e-business on demand: what's the connection? 
Grid computing is a key element in e-business on demand. Because 
it enables new kinds of power, flexibility and integration, IBM Grid 
Computing is a key element of the on demand operating 
environment. 

Back to top 

Does IBM use grid computing in its own infrastructure? 

Yes. IBM is a major user of grid computing. IBM's intraGrid, based 
on the Globus Toolkit, is a research and development grid that 
allows IBM to leverage many worldwide assets for research 
purposes and help us understand the complexities of managing a grid 
infrastructure on an enterprise scale. And IBM uses grids for other 
purposes throughout the company. One example is the IBM 
Boeblingen Lab Grid, composed of three IBM @ server pSeries™ 
clusters running AIX and LoadLeveler, a cross-departmental grid 
used to mn zSeries processor unit simulations. Jobs are submitted 
through a web portal, presenting users with the same interface as the 
one they used when running simulations on an isolated cluster. The 
WebSphere based portal uses the Globus Java CoG Kit to pre-select 
candidate queues for submitting each simulation, using Globus 
Metacomputing Directory Service. This pre-selection is based on 
cluster loads and job characteristics. Access to a shared DB2 
database allows for the automated generation of proxy certificates 
and for the monitoring and reporting of user jobs. 
^ Back to top 

If 1 want to learn more about IBM Grid Computing, what*s the 
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first step? 

Sign up for a Grid Innovation Workshop. These sessions offer a 
hands-on, business-specific understanding of grid computing's 
strategic, financial and operational advantages for your business. 
Customized to individual organizations, IBM Grid Innovation 
Workshops help companies examine how grid technology can help 
solve their specific information problems. The Workshop includes 
an Executive Session, work sessions, validation of findings and a 
preliminary plan. 

To sign up, contact us today 
^ Back to top 

What does it take to build a grid? 

Building a grid can be as simple as enabling a small number of PCs 
(or server or storage network) to take advantage of underutilized 
processing and storage. This can radically speed completion of a 
single set of data- or compute-intensive tasks. From a relatively 
small deployment, you could expand slowly or quickly, narrowly or 
widely, depending on business needs. Ultimately, an entire 
entaprise can be enabled for grid — and grids can bring together not 
only departments and processes within a single company but also 
those among separate enterprises. 
^ Back to top 

What about security in grid environments? 

Grid Security Infrastructure (GSI) is a public-key-based security 
protocol, using X.509 certificates, a widely employed standard. The 
protocol provides single sign-on authentication, which allows a user 
to create a proxy credential that can authenticate with any remote 
service on the user's, behalf, as well as conunimication protection and 
initial support for restricted delegation. 
^ Back to top 
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NEOS AND CONDOR: SOLVING OPTIMIZATION PROBLEMS 

OVER THE INTERNET* 

Michael C. Ferrist, Michael P. Mesnier*, Jorge J. MoreS 

Abstract 

We discuss the use of Condor, a distributed resource management system, as a 
provider of computational resources for NEOS, an environment for solving optimization 
problems over the Internet. We also describe how problems are submitted and processed 
by NEOS, and then scheduled and solved by Ck)ndor on available (idle) workstations. 

. 1 Introduction 

The NEOS Server [8] is a novel environment for solving optimization problems over the 
Internet. There, is no need to download an optimization solver, write code to call the 
optimization solver, or compute deriva;tives for nonlinear problems. NEOS provides the 
user with the input format and a list of solvers for the optimization problem. Given an 
optimization problem, NEOS solvers compute derivatives and sparsity patterns of nonlin- 
ear problems with automatic differentiation tools, link with the appropriate libraries, and 
execute the resulting binary. The user is provided with a solution and runtime^statistics. 

Each solver in the NEOS optimization library is maintained by a software administrator 
that is responsible for providing computing resources and for answering questions related 
to the solver. Registering the solver [8] on a few workstations provides adequate resources 
in most cases, but for large problems, however, we need a different approach. The obvious 
diflSculty is that the owner of a workstation is reluctant to provide large amounts of com- 
puting cycles and memory. We use Condor [15, 11], a distributed resource management 
systemi as a provider of computational resources for a NEOS solver. The resources that 
are managed by Condor are typically large clusters of workstations, many, of which would 
otherwise be idle for long periods of time. 

We discuss the connection between NEOS and Condor in the context of a single but 
important optimization problem: mixed complementarity problems. Our discussion shows 
that this approach can be extended to other problems as well. 

•This work was supported by the Mathematical, Information, and Computational Sciences Division 
subprogram of the Office of Computational and Technology Research, U.S. Department of Energy, under 
Contract W-31-109-Eng-38, by the National Science Foundation under Grants CDA-9726385 and CCR- 
9619765, and by the National Science Foundation, through the Center for Research on Parallel Computation, 
under Cooperative Agreement No. CCR-91 20008. 

^Computer Sciences Department. University of Wisconsin - Madison, 1210 West Dayton St., Madison, 
Wisconsin 53706. f err isQcs. vise. odu 

^Department of Computer Science, University of Illinois, 1304 W. Springfield, Urbana, Illinois, 61801. 
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^Mathematics and Computer Science Division, Argonne National Laboratory, 9700 South Cass Avenue, 
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Many different applications can be fornnulated as mixed complementarity problems; ex- 
amples are given in [9, 13]. If the nonlinear function F : R'* R" describes the interactions 
of a nonlinear process as a function of the variables z 6 R'*, then the mixed complementary 
ity problem is to find a vector a:, with cbmpdnents between lower and upper bounds t and 
u (with i<u)y such that 

Fi{x) = 0 if li < xi <Ui, 
. ^i(4>0.. if a:,= As (LI) 
Fi{x) < 0 if Xi = It,-. 

Solving a mixed complementarity problem in the typical computational environment re- 
quires that the user first develop code for the evaluation of F. The user must then decide 
on an appropriate solver, retrieve the solver, develop code to evaluate the Jacobian matrix 
F*{x) and sparsity pattern of F'(x), link their code with the necessary libraries, and finally 
execute the solver locally- With NEOS, the user need only specify the mixed complemen- 
tarity problem by providing code to evaluate the function F, the lower and upper bounds 
( and Uy and a starting point. The NEOS solver then generates code to compute the Jaco- 
bian matrix and sparsity pattern, compiles the user subroutines, links with the appropriate 
libraries, executes the solver on a NEOS machine, and returns a solution to the user. 

NEOS uses the Condor pool at the University of Wisconsiii for solving compTementarity 
problems. The pairing of NEOS . with Condor is an ideal combination. NEOS provides 
an interface that is problem oriented and independent of the computing resources. Users 
need only provide a specification of the problem; all other information needed to solve the 
problem is determined by the NEOS. solver. Condor provides the computational resources 
to solve the problem. 

Condor acts as a matchmaker, pairing computational resources with jobs that require 
processing. The job executes on the allocated workstation until completion or until the 
workstation becomes unavailable. In the latter case, the job is frozen in its current state and 
the workstation is returned to the owner. Condor is then contacted once again for pairing, 
and the job is restarted from its frozen state on the newly allocated resource. Condor 
pays special attention to the needs of . the workstation owner by allowing the owner to 
define the conditions under which the workstation can be allocated. This policy encourage 
workstation owners to place their resources in the Condor pool, and as a consequence, the 
Wisconsin pool currently has over 400 workstations. 

In Section 2 we describe the three interfaces that are currently available to submit 
problems to NEOS: e-mail, the NEOS Submission Tool (neos-submit), and the NEOS 
Web interface. These interfaces are designed so that problem submission is intuitive and 
requires only essential information. Paranieters that affect the progress of the solver are 
not required but can be specified, for example, by an auxiliary file. We concentrate on the 
NEOS Submission Tool. The NEOS Web interface can be sampled by visiting the URL 
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http : y/vvv .mcs . anl .gov/otc/Server/ 

for the NEOS Server. We emphasize mixed complementarity problems, but.NEOS handles 
a wide variety of linear and nonlinearly constrained optimization problems; solvers for 
optimization problems subject to integer variables are being added. We do not discuss 
the design and implementation of the Server because these issues are covered by Czyzyk^ 
Mesnier, and More [8]. Extensions to the NEOS Server and the network computing issues 
that arise from the emerging style of computing used by NEOS are discussed by Gropp and 
More [14]. 

Mixed complementarity problems submitted to the NEOS Server are currently solved 
by the PATH [10, 12] solver, which implements a Newton-type method for solving systems 
of non-differentiable equations. Sparse matrix techniques are used for large problems. The 
process used to solve a nonlinear complementarity problem by this NEOS solver includes 
the generation of derivative and sparsity information with the ADIFOR [4, 5] automatic 
differentiation tool and the solution of the problem with Condor. The process is governed by 
a solver script that must check the user data and provide appropriate messages in the case 
of errors. In Section 3 we describe the various issues that must be addressed by the solver 
script. These issues are important to the development of reliable optimization software and 
problem-solving environments. 

The automatic differentiation techniques used to generate derivatives and sparsity pat^ 
terns for nonlinear complementarity problems are described in Section 4. In particular, we 
.explain how to obtain a sparse representation of the Jacobian matrix that is suitable for 
PATH. The Jacobian matrix generated by ADIFOR is accurate to full niachine precision, 
while the Jacobian matrix generated by differences of function values suffers from trunca- 
tion errors. Moreover, the code produced by ADIFOR for the computation of the sparse 
Jacobian matrix is typically more efficient than the code produced by differences. On the 
other hand, the code produced by ADIFOR may not be as efficient as a hand-coded Jacobian 
matrix. See [1] for a full comparison (in terms of memory and speed) of AD IFOR-generated 
Jacobian matrices with both hand-coded and difference approximations, and [2] and [7] for 
performance issues related to the automatic computation of gradients. 

The automatic generation of the Jacobian matrix and sparsity pattern in the NEOS 
version of PATH makes the code more accessible and useful than requiring the hand-coding 
of the Jacobian matrix. Indeed, all nonlinear solvers in NEOS use automatic differentiation 
tools to compute gradients, Jacobians, and sparsity patterns. We intend to incorporate 
ADIC [6] into most of the nonlinear NEOS solvers to allow problems to be specified in C, 
as well as Fortran. 

The final section of the. paper describes how the Condor system at the University of 
Wisconsin is used to process the submitted jobs. Only large jobs are scheduled on Condor 
because there may be a delay in execution while waiting for an idle workstation. Small 
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jobs are executed immediately on non-clustered workstations. No computational results are 
given here. Users are encouraged either to submit one of the supplied sample problems or 
to generate new mixed complementarity problems to test the system. 

2 The NEOS Server 

The NEOS Server provides Internet access to a library of optimization solvers with user in-' 
terfaces that abstract the user from the details of the optimization software. The user needs 
only to describe the optimization problem in a particular format; all additional information 
required by the optimization solver is determined automatically. This abstraction is similar 
to that provided by modeling languages. The NEOS solvers provide several different input 
formats to allow users to specify optimization problems in a convenient manner, without 
necessarily rewriting their problem in a modeling language. 

The NEOS approach offers considerable advantages over a conventional environment for 
solving optimization problems. Consider, for example, mixed complementarity problems. A 
NEOS solver for mixed complementarity problems require that the user specify the number 
of variables n, a subroutine initpt(n,x) that defines the starting point, a subroutine 
. xbound(n,xl,xu) that sets the lower and upper bounds, and a subroutine fcn(n,x,f) 
that evaluates the function F. Since there is no need to provide, the Jacobian matrix or 
the sparsity pattern of the Jacobian matrix, the. user can concentrate on the specification 
of the problem. Changes to the fen subroutine can be made and tested, immediately; the 
advantages in terms of ease of use are considerable. 

Other optimization problems can be specified in a similar manner. For example, the 
nonlinearly constrained optimization problem 

min {/(x). :xi<x <Xu, ci < c{x) <Cu} 

can be specified by four subroutines. The bounds xi and Xu are specified with the sub- 
routine xboiind(xi,xl,xu), the constraint bounds ci and Cu are specified with the subrou- 
tine cbound(m,.cl,cu), the objective function / : R is defined by the subroutine 
fcn(n,x,f ), and the nonlinear function c : R"* is defined by cf cn(m,x,c). 

We have mentioned nonlinear optimization solvers, but NEOS contsdns solvers in other 
areas. A complete listing is available at the NEOS Server homepage: 

. http://www.mcs.anl.gov/otc/Server/ 

The addition of solvers is not diflScult. Indeedj as discussed in [8], NEOS was designed so 
that solvers in a wide variety of optimization areas can be added easily. 

We provide Internet users the choice of three interfaces for submitting problems: e-mail, 
the NEOS Submission Tool, and the NEOS Web interface. These interfaces are designed 
so that problem submission is intuitive and requires the minimal amount of information. 




Figure 2.1: The NEOS submission form for PATH 



The interfaces differ only in. the way that information is specified and passed to the NEOS 
Server. 

•TO- . 

The e-mail interface is relatively primitive, but useful because most users have easy 
access to e-mail. Information on the available solvers and on the format used to submit 
problems via e-mail can be obtained by sending the mail message help to 

neosQmcs.anl.gov 

Users interested in the Web interface should visit the homepage for the NEOS Server, which 
has links to all the solvers in the library, as well as pointers to other NEOS information, 
in particular, the NEGS Guide. In the remainder of this section we examine the NEOS 
Submission Tool. 

The NEOS Submission Toor provides a high-speed link to the NEOS Server via TCP/IP 
sockets. Once this tool is installed (only Perl [17] is required), the user has access to all 
solvers offered by NEOS. Additional information on the NEOS Submission Tool, including 
installation instructions, can be obtained from the NEOS Server homepage. 

Submission of problems via the NEOS Submission Tool is simple. The user must first ' 
choose the type of optimization problem and then select the desired solver. Once the solver 
is selected, the user is given a submission form specific to the solver. 

The PATH submission form, shown in Figure. 2.1, requires that the user specify, the 
number of variables, the files for the initial point, bounds on the variables, and function 
evaluation subroutines. 

Figure 2.1 shows the NEOS Submission form for a model of oligopolistic pricing [16] 
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with 63 variables. The model is defined by the file opjf cn, while the initial point and the 
bounds on the problem are defined by the files op jinitpt and op_xboimd. Note that in 
this case two PATH options are set and that we have requested the use of Condor for the 
solution of this problem. 

The user has the option of using a Condor pool of workstations for solving the submitted 
problem. Since Condor is essentially a batch processing mechanism, the user is also allowed . 
to specify a timeout for Condor. After this period of time, the Web browser or Submission 
Tool is released from its busy state and returned to the user. The job, however, continues 
to process, and the results are returned to the user at a later date via e-mail. The default 
timeout is 5 minutes; Condor is used by default on all problems that are larger than 500 
variables. 

The PATH submission form allows the user to provide a specification file that can be 
used to set tolerances and other parameters that govern the algorithm. For most problems 
the defaults provided are adequate. Figure 2.1 shows two options in use for this submission. 
The first provides a listing of the current settings of all the available options for this run; 
the second just turns off the default crash technique. The form also has room for comments, 
which can be used to identify the problem submission: 

Once specified, the problem is submitted to NEOS where it is then scheduled. for execu- 
tion. A variety of computers, even a massively parallel processor, could be used to solve the 
problem. At present these computers are workstations that reside at Argonne National Lab- 
oratory, Northwestern University, the University of Wisconsin, Lawrence Berkeley National 
Laboratory, the Technical University of Ilmenau in Germany, and Arizona State University. 

3 Solving Complementarity Problems: PATH 

The process used to solve a nonlinear complementarity problem by NEOS is illustrated 
in Figure 3.1. This process includes the generation of derivative and sparsity information 
with the ADIFOR [5, 4] automatic differentiation tool and the solution of the problem with 
the Condor [15, 11] distributed resource management system. We discuss ADIFOR further 
in Section 4, while Condor is discussed in Section 5. In this section we discuss issues in 
the solution process that are pertinent to the development of optimization software and 
problem-solving environments. Although the discussion is specific to PATH [10, 12], most 
of the issues are applicable to all the solvers of nonlinear optiniization problems in NEOS. 

Submitting a problem to the NEOS Server does not guarantee success, but NEOS users 
are able to solve difficult optimization problems without worrying about many of the details 
that are typical in a conventional computing environment. Even if the user has suitable op- 
timization software, the user would need to read the documentation, write code to interface 
his problem with the optimization software, and then debug this code. The user would also 
have to write code to evaluate the Jacobian matrix and sparsity pattern, and debug that 
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Figure 3.1: PATH and Condor 



code— a nontrivial undertaking in most cases. 

For a typical submission, the user recdves information on the progress, and the solution. 
Figure 3.2 shows part of the output received when the problem in Figure 2.1 is submitted 
to NEOS via the NEOS Submission Tool. In particular, we see the NEOS server selecting 
an available workstation, transferring all user data to the workstation, and then invoking 
the solver remotely. The solver (in this case RATH) checks the data and compiles the user's 
code. If any errors are found at this stage, the compiler error messages are returned to the 
user, and execution terminates, . 

If the user's code compiles correctly, the autorhatic differentiation tool ADIFOR [5, 4] 
is. used to generate the Jacobian matrix and the. sparsity pattern. Additional details on 
this part of the process are discussed in Section 4. Once the Jacobian matrix and sparsity 
pattern are obtained, the user's code is linked with the optimization libraries, and execution 
begins. Results are returned in the window generated by the NEOS Submission Tool. 

The solver script that handles the solution process must check the input data to make 
sure that the job submission is valid. A typical error at this stage of the solution process is 
for the user to interchange files and to send, for example, subroutine initpt where NEOS is 
expecting subroutine xbound. A similar error is to neglect to send one of the files required 
for the job submission. These errors are detected by the solver script by checking that the 
files that specify fen, initpt, and xbound exist and that they reference the appropriate 
subroutine. The solver script also checks that the problem dimensions are positive. 

Even if the supplied subroutines compile correctly, ADIFOR may find an error during 
the generation of the fen subroutine. The most common error here is for the submission to 
contain an improper calling sequence. For example, if the calling sequence pf the submitted 
fen is f en(n,x ,y) , ADIFOR generates an error because it is assuming that the independent 
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. Figure 3.2: Output from the NEOS Submission tool 



variables are f. On the other hand, ADIFOR does not generate an error if the calling 
sequence is fcn(n,x,f ,w) because now there is a dependence on f . All error messages 
generated by ADIFOR are sent back to the user. 

If the derivative and sparsity information is generated, this information is sent to PATH. 
Errors may also occur during this part of the solution process, and it is again important 
to send appropriate messages back to the user. At present, we check only that the user 
function does not create any system exceptions during the evaluation of the function at the 
starting point or at any of the iterates. Although simple, this test catches many user errors. 
In particular, this test does not allow a calling sequence of the form fcn(n,x>f ,w). 

Additional checks on the function would be desirable, but seem to be difficult to imple- 
ment. For example, we would like to check that the function provided is indeed differen- 
tiable. If the user provides a function that is discontinuous, automatic differentiation tools 
will generate the Jacobian matrix but will not be able to detect this situation. 

4 ADIFOR 

Figure 3.1 shows that given the function F that defines the mixed complementarity problem, 
the automatic differentiator tool ADIFOR/SparsLinc [3, 5] is used to produce the Jacobian 
matrix of F and the sparsity structure of the Jacobian matrix. This information is then 
fed to PATH. In this section we describe the process for generating a representation of the 
sparse Jacobian matrix of the function F that is submitted to the PATH solver. This process 
is of interest to any researcher who wishes to use automatic differentiation tools. 

The first step in using ADIFOR is to create a script file that defines the dependent and 
independent variables, the name of the subroutine that needs to be differentiated, and a 
composition file. 
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AD_PROG = fcn.comp . 

AD.TOP = fen . 

AD_IVARS =. X . 

ADJ)VARS = f 

AD-SEP ; = - 

ADJDUTPUTJ)IR - 

AD-FLAVOR = sparse 

Figure 4.1: Script file fen. script for ADIFOR 

Figure 4.1 shows the script file that is used with PATH. This file tells ADIFOR that the 
subroutine that needs to be differentiated is called fen, that the independent variables are 
X, and that the dependent variables are f . The composition file is specified, with AD-PROG, 
so in this case the composition- file is f cn.eorap. 

The composition file contains a list of all the files that are required to compute the 
function, and also a sample program that specifies the calling sequence for the subroutine 
fen. The composition file used with PATH is shown below: 

. ^cn.f 
fcn^ample.f 

This file tells ADIFOR that file f en.f contains the subroutine fen and that the sample 
program is contained in file fen_sample.f . 

All subroutines that are required to evaluate the function must be in the file fen. f. 
If other subroutines are needed (for example, some bias subroutines), they also must be 
included in f en.f . 

The sample file f cn_sample.f is not strictly needed because we already know that the 
subroutine to be differentiatied is called f cii. However, in older versions of ADIFOR, this file 
is needed. Figure 4.2 shows the sample file that is used with PATH. The only information 
specified by this file is the calling sequence used by fen. 

program f cnjsample 
integer n 

double precision x(n), .f(n) 

call fen(n,x,f) 

end 

Figure 4.2: Sample file f en^eomp for ADIFOR 

Given the information in the script and composition files, ADIFOR can be used to 
generate a subroutine that computes the Jacobian matrix. Since we are interested in com- 
puting sparse Jacobian matrices, the subroutine that ADIFOR generates iises special data 
structures (called objects) to define, the Jacobian matrix. 
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The command Adif or2 .0 AD-SCRIPT=fcn. script instructs ADIFOR to generate a sub- 
routine of the form 

gjfcn(n,x,g-x,f ,gjf). 

where gjc is a gradient object for the independent variable x and gjf is a gradient object for 
the function F. These objects are manipulated and accessed by PATH as described below; 
Further details on how to invoke ADIFOR can be found in [3, 5].. 

We compute the Jacobian of F by manipulating the gradient object with subroutines 
provided by ADIFOR in the SparsLinc [4] library. We first set the gradient object gjc for 
the independent variable x to the identity matrix with the code segmjent 

do j = 1, n 

.call dspsd{g^x(j),j,i.dO,l) 
end do 

Once g-x is defined, we use the ADlFOR-supplied subroutine dspxsq to compute the Jaco- 
bian matrix. The call . 

call dspxsq(ind-CQl,val,n,g-f (i),lenrow,info) 

. extracts the ith row of the Jacobian matrix. On exit from this call to dspxsq, the array 
ind.col contains the column indices of the tth row of the Jacobian matrix, the array val 
contains the values of the rth row, and the variable lenrow is the number of nonzeros. 

. Two key difficulties arise when using the derivative information provided by ADIFOR in 
PATH. The first difficulty is that PATH, like.most optimization software for sparse problems, 
assumes that the sparsity structure is known for all values of the independent variables x. 
This information is needed in order to preallocate enough storage for the Jacobian matrix 
and to minimize the cost of preprocessing the Jacobian matrix. For exampie| orderings 
that reduce the fill in an elimination algorithm use the sparsity structure, so if the sparsity 
structure changes at each iteration, then the ordering will have to be recomputed at each 
iteration. Dynamic storage allocation schemes could be used, but these schemes tend to 
increase the overall computing time significantly. 

We determine a sparsity structure that is valid for all values of the independent V9iriables 
X by evaliiating the Jacobian matrix at a random perturbation of the initial point provided 
by the user. We cannot use the initial point provided by the user to determine a sparsity 
structure that is valid for all values of the independent variables x because the initial 
starting point tends to be special (for example, the vector of all zeros or all ones), and 
thus the resulting sparsity structure is not representative. This heuristic was also used by 
Bouaricha and Mor^ [7] in a similar situation. . 

If the sparsity pattern changes as the iteration proceeds then the heuristic that we are 
using may fail. However, this situation seems to be rare. Heurislbics that detect changes in 
sparsity patterns are the subject of current research. 
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The second difficulty is that PATH uses a pivotal method to compute the step between 
iterates, and this method requires that the Jacobian matrix be stored by columns. On the 
other hand, ADIFOR computes the Jacobian matrix by rows. Storing the Jacobian matrix 
in a compressed column format specified by an array indjrov of row indices and an array 
col-ptr that points to the start of each column is not difficult. We use an additional array 
col^tart that initially agrees with col.ptr. As we run through the rows of the Jacobian 
matrix generated by using dspxsq, the entries from val are immediately put into their 
correct location (as determined by indjrow), and the corresponding entry of col_start is 
incremented. Note tha.t the resulting column-wise storage is sorted by row indices, even 
though this is not required by PATH. A final check to determine whether the allocated 
. storagj^ is sufficient is carried out after all rows have been processed. 

5 Condor 

Condor [11, 15] is a distributed resource management system, developed at the Univer- 
sity of Wisconsin, that manages large heterogeneous clusters of workstations. Due to the 
ever decreasing cost of low-end workstations, such resources are becoming prevalent in. 
many workplaces. The Condor design was motivated by the needs of users who would like 
to take advantage of the underutilized capacity of these clusters for their loITg-running, 
computationally-intensive jobs. Condor has been ported to most UNIX platforms and has 
been used in production mode for more than eight years in the Computer Sciences De- 
partment of the University of Wisconsin and many other sites. A version that runs under 
Windows NT is under development. The system is publicly available under the GNU copy- 
left restrictions and can be downloaded from 

http://www.cs.wisc.edu/condor/ 

In order to generate vast amounts of computational resources, such a system must use 
any kind of resource whenever it is made available. Condor acts like a matchmaker, pairing 
these computational resources with jobs that require processing. The job executes on the 
allocated machine until it completes or the resource disappears. In the latter case, the job 
is checkpointed, the machine is returned to the owner, and Condor is contacted once again 
for pairing. Checkpointing a job is the process of saving the current state of the job in a 
way that allows restarting from precisely the same point of execution. 

Condor preserves a large measure of the originating machine's environment on the exe- 
cution machine, even if the .ori^nating and execution machines do not share a common file 
system. Condor jobs that consist of a single process (it is possible to run PVM on a Condor 
cluster) are automatically checkpointed and migrated between workstations as needed to 
ensure eventual completion. Condor is flexible and fault- tolerant: the design features ensure 
the eventual completion of the job. This feature is important for our application. 
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A key design feature of the Condor system is that the owner of the resource should have 
as little interference frorn the resource allocation server as possible; in this way, more owners 
will make their resources available to the pool. Condor pays special attention to the needs 
of the interactive user of the workstation by allowing the user to define the conditions under 
which the workstation can be allocated by Condor to a batch user. As a consequence, there 
are currently over 400 workstations in the Wisconsin pool. 

The use of Condor for solving complementarity problems generated from NEOS is in- 
tended to be an example, showing how a wide variety of software tools can be interfaced 
and used in a practical operations research environment. The development of software 
tools is extremely important, but frequently there are few examples demonstrating how the 
developer envisioned these tools being used. Such examples serve as a prototype for new. 
applications and show potential users how to develop their applications and problems for 
network solution. 

Figure 3.1 shows all the steps used by NEOS to solve a mixed nonlinear complementarity 
problem ; We have already discussed the generation of derivative and sparsity patterns 
in Section 4. We now: outline how Condor schedules job. submissions on an appropriate 
workstation from the Wisconsin p>ool. . 

Using Condor-managed resources is easy. Fortran or C code that runs und^f one of the 
supported systems can be relinked by using libraries from the Condor system without any 
changes.to the source code. The solver script schedules only large jobs for this facility, since 
there may be a delay in execution while waiting for an appropriate idle workstation. Small, 
job^ are executed directly on a non-clustered maxrhine at Wisconsin. 

The first step in preparing a code for solution using Condor is to ensure that the code 
compiles and runs on the native machine of that class. Since the PATH solver is already 
tested and available in library form, this amounts to checking that the submitted routines 
and ADIFOR-generated routines can be compiled, exactly as is done for a submission that 
is to run on a local, machine. The second step links th^ objects to the PATH objects, 
replacing some of the standard libraries with Condor-supplied replacements. Our interface 
replaces a single line in the standalone makefile, namely, . 

f 77 -o pathsol $ (OBJECTS) $(LIBS) 

with the following line 

condor-compile f77 -o pathsol $ (OBJECTS) $(LIBS) 

Both of these makefiles use precisely the same library of routines that implement PATH as 
described in [12]. The only difference is that different system libraries are linked into the 
executable in order to facilitate the checkpointing that was mentioned above. 
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Instead of executing pathsol on the machine where it was compiled, the solver script 
generates a job description file that details the location of the executable, the requirements, 
of the job, and all input and output files. For NEOS, the job description file jdf is 

Executable = pathsol 

Log = condor-dir/condor.log 

Coresize P 

Notification = never 

Queue 

This file specifies, in particular, that condor. log, in directory condorjdir, is the Condor 
log file. The purpose of the log file is discussed below. 

The job is submitted to Condor by using the command cdndor.5ubmi1: jdf. At 
this stage, Condor takes over control of the job. For security purposes, the job runs as user 
nobody, thereby limiting the access of the submitted job to files that it owns. 

In the remainder of this section, we outline how we allow NEOS to timeout from the 
Condor job and how we guarantee that the job results are returned to the NEOS user - 
even if the machine that submitted the job to Condor dies during the execution of the job, 
or communication between NEOS and Condor dies. Note that the Condor job is detached 
' from the submitting machine and is guaranteed to continue to execute to completion. 

We first create a persistent directory, condor_dir, on the machine that submits the 
Condor job. AH files related to Condor jobs are located in condorjdir. When a job is 
submitted, we create a symbolic link into condor _dir the job directory created by the 
NEOS Communications Pax:kage - the facility enabling communication between a solver 
and the NEOS Server. This job directory serves as the repository for both incoming job 
subrhissions and outgoing results. 

The Condor log file, condor . log, resides in condor_dir and shows where each job from 
NEOS was submitted and executed. We have created a program for monitoring this log 
with the UserLogA.PI that is part of the Condor distribution. The monitoring program, 
watchlog, is regularly invoked by the system utility cron a;nd immediately exits if it finds 
another version of watchlog running. 

The purpose of watchlog is to ensure that the results of any job submitted to Condor 
are written to the appropriate job directory and to signal that the Condor job is finished. 
Condor writes to condor.log every time the status of the job changes. The watchlog 
monitoring program acts upon two events that are written into condor.log, namely, EXE>- 
CUTE and TERMINATE. If the job in question starts executing on a machine, watchlog 
adds the IP address of this machine to the list of machines that have been used (or this job. 
At the end of processing, this list is appended to the job results; an example of such a list 
is given below: 
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<128.105.40.8:32776> 
<128 . 105 . 41 , 104 : 32839> 
<128.105.76.12:36651> 
<128.105.5.li:56558> 

The second event that triggers the watchlog program is the fact that the job in qu^tion 
is terminating. In this case, watchlog writes the status of the Condor job to a file called 
CONDORJDONE in the job directory. Once the job completes, the results are returned to the 
NEOS user by the mechanism we how describe. 

First note that even if a job was executed under Condor, the solver creates exactly the 
same solution files and writes these files in the job directory as before. To guarantee that the 
results are returned to the NEOS user we have to deal with two cases. Firstly, the submitting 
machine may die while the job is being executed under Condor and secondly, the NE50S 
user may not be willing to wait more than 5 minutes for the job results to be returned. To 
deal with both these cases, we have created another cron program, job-checker, to ensure 
that the job results are returned either to NEOS or directly to the user via e-mail. 

This program simply monitors the job directory, checking for the existence of the files 
CONDOR and CONDORJ)ONE (signifying that Condor was used and that Condor had completed 
the job) and that the file DONE does, not exist. The solver script creates the file DONE once 
the user has been notified of the job results. Thus if this file is not present, and the job was 
a Condor job that had timed-out, we return the job results to the user via e-mail. There 
is a slight race condition here: it is possible that a (Condor) timeout can occur while the 
job is being returned to NEOS. In this case, the user may get notified of the results both 
in the NEOS submission, tool or WEB browser and via email. We believe this strategy is 
preferable to the possibility of losing some results. 
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Abstract 

Application development for high-performance dis^ 
irihuted computing systems, or computational grids as 
they are sometimes called, requires "grid'enabled" tools 
that hide mundane aspects of the heterogeneous grid 
environment without compromising performance. As 
part of an investigation of these issues, we have devel- 
oped MPICH'G, a grid-enabled implementation of the 
Message Passing Interface (MPI) that allows a user to 
run MPI programs across multiple computers at dif- 
ferent sites using the same commands that would be 
used on a parallel computer. This library extends the 
Argonne MP I CM implementation of MPI to use ser- 
vices provided by the Globus grid toolkit In this paper, 
we describe the MPICH-G implementation and present 
preliminary performance results. 



1 Introduction . 

High-performance '^computational grids" [11] in- 
volve heterogeneous collections of computers that may 
reside in different administrative domains, run different 
software, be subject to different access control policies, 
and be connected by networks with widely varying per- 
formance characteristics. We believe that application 
development in these environments requires specialized 
'•grid-enabled" tools that hide mundane aspects of the 
heterogeneous grid environment without compromising 
performance. These tools may implement familiar pro- 
gramming models, such as message passing, data paral- 
lelism, or object parallelism (p>erhap>s with extensions), 
or may implement completely new programming mod- 
els. In either case, research is required to understand 
the utility of different approaches and the techniques 
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that may be used to implement these approaches in 
different environments. 

As part of an investigation of these issues; we have 
developed MPICH-G, a grid-enabled implementation 
of the Message Passing Interface (MPI) that allows the 
user to run MPI ' programs across multiple computers 
at different sites using the scime commands that would 
be used on a parallel computer. This library extends 
the Argonne MPICH implementation of MPI [15] to 
use services provided by the Globus grid toolkit [10], 
as follows: 

1. The Globus information service is used to deter- 
mine how to obtain access to the computers in 
question. 

2. The Globus security service is used to handle au- 
thentication and authorization at each site. 

3. The Globus executable management service is. 

used to stage execu tables. 

4. The Globus resource management service is used 
to start processes on each computer, interfacing 
with local schedulers where necessary. 

5. The Globus communication service is used to man- 
age the different communication methods that 
may apply in a heterogeneous environment, such 
as vendor-supplied protocols or TCP/IP. 

6. The Globus file access service is used to direct 
standard output and error (stdout and stderr) 
streams to the user's terminal and to provide ac- 
cess to files regardless of location. 

7. Globus process management facilities allow the 
programmer to monitor the progress of an appli- 
cation and terminate it if desired. 



MPICH-G is a complete impleinentatioii of the MPI- 
1 standard and passes the MPICH test suite. Early ex- 
periences suggest that it achieves our goal of reducing 
barriers to the use of distributed computing by allow- 
ing the use of MPI as a portable, high-performance 
programming model for heterogeneous clusters and for 
wide-area computing systems. Several groups (e.g., at 
Lawrence Livermore National Laboratory (LLNL) and 
NASA Ames Research Center) are using it to run con- 
ventional MPI programs across multiple massively par- 
allel processors (MPPs) within the same machine room: 
In this case, MPICH-G is used primarily to manage 
startup and to achieve efficient cominunication via use 
of different low-level communication methods. Other 
groups are using MPICH-G for metacomputing exper-. 
iments, in which applications are distributed across . 
MPPs located at different sites: Larsson for studies of 
distributed execution of a large computational electro- 
magnetics code [17], and Chen and Taylor in studies of 
automatic partitioning techniques as applied to finite 
element codes [4]. MPICH-G can also be used to im- * 
plement distributed visualization pipelines and similar 
applications in which components are located at differ- 
ent sites. In these latter exampleis, MPICH-G is used 
to manage heterogeneous authentication and startup 
mechanisms. 

In the rest of this article, we describe the problems 
that we faced in developing MPICH-G, the techniques 
used to overcome these problems, and preliminary ex- 
perimental results that indicate the costs associated 
with the MPICH-G implementation. 

2 The Need for Grid-Enabled Tools 

An extensive body of experience shows that the 
coupling of geographically distributed computers, 
databases, scientific instruments, and people can en- 
able interesting new applications. Distributed super- 
computing [19], knowledge synthesis [20], online in- 
strument control [16], and teleimmersion [6] are just 
four examples. However, experience also shows that 
the barriers to the construction of such applications 
are considerable. Few programmers take the time to 
rnaster the intricacies of such grid environments, and 
even then often produce applications that are fragile, 
nonportable, and perform poorly. 

The specific problems encountered by the develop- 
ers of such grid applications vary widely according to 
the grid environment and application type in question. 
We use Figure 1 to illustrate some of the problems that 
we have been concerned with in the development of 
MPICH-G. This figure shows three massively parallel 
processing (MPP) systems, each constnicted from sym- 



metric multiprocessor (BMP) nodes. Two of the MPPs 
are located within the same institution and hence are 
connected by some form of (hopefully high-speed) local 
. area network (LAN), while the third is located at a re- 
mote site and hence is reached by a wide area network 
(WAN). The following is a partial list of the problems 
that we may encounter in such an environment; 

1. The two sites will likely operate different authenti- 
cation and authorization mechanisms and impose 
different access control policies. A user is unlikely 
to have the same user id at the two sites. 

2. The two sites are unlikely to share a file sys- 
tem. Hence, specialized techniques are required 
to transfer executables and program files between 

. sites. 

3. The different MPPs may be controlled by different 
schedulers with different scheduling policies. 

4. We need to allocate resources concurrently at 
multiple sites and establish a single, compu- 
tational environment (in MPI terms, a single 

. MPI jCOMMJ/ORLD) that spans those resources. (We 
refer to this as the "co-allocation" problem.) 

5. Efficient communication requires that diflferent 
communication methods be used in different sit- 
uations. Within an SMP, shared-rnemory com- 
munication should be used, whether by using * 
explicit shared-memory operations or by using 
shared memory operations to provide fast imple- 
mentations of other abstractions such as message 
passing. Between SMPs within the same MPP, a 
vendor-supplied message-passing library should be 
used. Only between MPPs should the universally 
available but slow TCP/IP be used. (An exception 
to this rule is shown in the upper MPP in Figure I. 
In some cases, a limitation on the number of nodes 
that can communicate using the vendor-supplied 
library may require the use of TCP/IP even within 
an MPP.) 

6. The topology of the overall computational system 
needs to be taken into account when implement- 
ing communication algorithms. Taking into ac- 
count the different TCP/IP performance (in terms 
of both absolute speeds and bisection band widths) 

. within an MPP, over a LAN, and over a WAN, the 
example system features five different communica- 
tion speed regimes. 

We believe that the solution to these types of prob- 
lem is to develop grid-enabled tools that provide cflK- 
cient implementations of familiar (or unfamiliar) pro- 
gramming models for use by application developers. In 
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Figure 1. The structure of a prototypical "computational grid" computing environment, of the type 
supported by MPICH-G. See text for details. 



developing these implementations, the tool developer 
must be concerned not only with translating the pro- 
gramming model to the grid environment, but also with 
revealing to the programmer those aspects of the grid 
environment that impact performance. For example, 
a grid-enabled MPI might handle automatically issues 
of authorization, startup, and process management, 
hence addressing the first four points listed above. It 
might also incorporate specialized techniques for pointr 
to^point and collective communication in highly hetero- 
geneous environments, hence addressing points 5 and 6. 
Finally, it might also extend the MPI model to provide 
programmers with access to resource location services, 
information about grid topology, group communication 
protocols, and quality-of-service management services, 
so as to enable new programming techniques appropri- 
ate for grid environments. 

In principle, such grid-enabled tools could be con- 
structed from scratch. However, the task is greatly 
simplified if the programmer has access to appropri- 
ate low-level services. As we explain below, we use the 
Globus toolkit as a source of such services in our work. 

The state of the art with respect to such tools is not 
very advanced. Systems such as Condor [18], NEOS [5], 
and NetSolve {3] all implement grid-based program- 
ming models of various sorts. Various implementations 



of message passing libraries provide some support for 
heterogeneous execution (e.g., p4 [2] and PVM [14]), 
but these systems do not support the flexible use of 
alternative low-level communication protocols, inter- 
faces to different MPP schedulers, or the MPI stan- 
dard. PVMPi [7] exploits a renaming capability pro- 
vided by MPrs profiling interface to use PVM mech- 
anisms to couple vendor-supplied M Pis on different 
MPPs. The resulting system supports heterogeneous 
execution of MPI programs but cannot deal with het- 
erogeneous startup mechanisms or dynamic selection 
of communication methods. 

3 Building Blocks 

Our grid-enabled MPI implementation is con- 
structed from two existing software systems: MPICH 
and Globus. We describe these briefly here. 

3,1 MPICH 

MPICH [15] is the most widely used implementa- 
tion of the MP! standard. Its architecture features a 
layered design, in which higher-level MPI communica- 
tion constructs .such as collective operations, communi- 
cators, and topologies are implemented in terms of ba- 
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sic communication operations provided by an "abstract 
device." Various sudi devices have been designed, en- 
abling high-performance implementations of MPICH 
on a variety of platforms. We exploit this device archi- 
tecture in our work, defining a "Globus communication 
device" that supports the use of multiple low-level com- 
. munication methods in heterogeneous wide area envi- 
ronments. ^ 

MPICH also defines a uniform startup mechanism 
for MPI programs. For example, the command 

mpirun -np 64 oyprog 

starts the MPI program myprog as 64 processes, 
whether on a shared-memory multiprocessor (via 
fork), a set of workstations on a local area network 
(ci.g., via rsh), or on an MPP (e.g., via.POE commands 
on an IBM SP). Our MPICH-G implementation allows 
the same command syntax to be used even when start- \ 
ing programs across multiple MPPs of different archi- 
tectures. We believe that it is a significant achievement 
that we can provide a similarly simple and uniform in- 
terface in much more complex grid environments. 

3.2 Globus 

Globus is a widely used toolkit for building wide 
area applications. The toolkit comprises a set of inter- 
related components, each providing services and asso- 
ciated APIs that address a distinct aspect of wide area 
computing [10]. Components developed to date are 

1. the Nexus communication library, providing sup- 
port for multimethod communication; 

2. resource management services, providing uniform 
interfaces to local schedulers and support for bro- 
kering and co-allocation (see below); 

3. security services, prpviding support for single sign- 
on, multiway security contexts, and interfaces to 
local security services; 

4. file access services, providing staging services and 
uniform interfaces to files, regardless of location; 

5. an Lightweight Directory Ac- 
cess Protocol (LDAP)-based information service, 
the Metacomputing Directory Service (MDS), pro- 
viding uniform access to up-to-date information 
about Globus resource structure and state; 

6. a fault detection service, providing a notification 
service for faulty processes; and 

7. executable management services that support 
staging of execu tables to remote computers. 



Globus has distinct local service, global service, and 
client components. At Globus sites, a small set of 
servers provide (deliberately simple) local services such 
as authentication, resource allocation, and status mon- 
itoring. In particular, a Globus Resource Allocation 
Manager (GRAM) implements a uniform interface to 
local resources (computers, networks, etc.) for authen- 
tication and allocation. Additional ^/o6d/ services, de- 
fined in terms of these local services, provide more so- 
phisticated functionality, such as resource brokering, 
. co-allocation of resources, and fault detection. Finally, 
client libraries allow application programs and tools to 
invoke local and global services. 

Globus toolkit components are designed to support 
the incremental development of grid-eiiabled tools arid 
applications. In principle, the user should be able to 
take either an existing or new program and gradually 
nriake it more "grid-aware" by introducing additional 
services. Preliminary application experiences suggest 
that this incremental development methodology works . 
well [10). Various groups are using a similar methodol- 
ogy to apply Globus components in other tool projects 
(e.g., [1, 13]); however, MPICH-G is the mostsophis- 
tiicated such system constructed to date. 

4 The MPICH-G Library 

We briefly describe the techniques used to imple- 
ment some of the MPICH-G capabilities listed in the 
introduction. 

4.1 Startup: mpirun and the machines File 

MPICH provides a standard command for starting 
MPI programs, namely, mpirun. This command spec- 
ifies the number of processes that are to be created 
and can also provide flags relating to debugging and so 
forth. 

On a parallel computer such as the IBM SP, the 
MPICH implementation of mpirun simply generates 
an appropriate job submission command to whatever 
scheduler is used to obtain access to the MPP. On the 
other hand, in a network of workstations, environment, 
a machines file is accessed to determine which ma- 
chines the MPI program should be started on. For 
example, the following file indicates that one process 
should be started on each of donner and dalek, and 
two processes on pitcaim. 

donner 
dalek 
pitcaim 2 
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Our only change to the MPICH startup model is 
that we generalize the contents of the machines file to 
include resource manager (GRAM) names. For exam- 
ple, the following file names three siich resource man- 
agers, at three different sites: 

doimer.mcs.anl.gov*fork 8 
bonny . i s i . edu*f ork 8 
moti4.ncsa.uiuc.e<}u-lsf 64 

The MPICH-G implementation then uses the 
Globus information service, MDS, to perform a simple 
form of resource location, accessing MDS to determine 
detailed contact information (e.g*, port numbers) for 
the specified resource managers. Hence, the user need 
not be concerned with low-level details regarding the 
physical location and interfaces of resources. 

The user can build on this simple capability to im- 
plement more sophisticated resource location schemes. 
For example, rather than specifying node counts in the 
machines file, the user can perform an MDS search to 
determine how many nodes are available on each ma- 
chine, and can rewrite the machines file appropriately. 
Or, the user can perform an MDS search to locate re- 
source managers with particular properties (e.g., idle 
nodes and specified network bandwidth) and then place 
the names of those systems in the machines file. 

4.2 Job Submission and Execution 

Once the machines file has been read and resource 
manager contacts determined, the MPICH-G mpirun 
implementation calls a Globus-provided function called 
globusrun to manage the task of job submission and 
execution. This function uses a variety of Globus ser- 
vices and libraries, as follows: 

Co-allocation. As noted above, the creation of a 
computation that spans multiple MPPs is a difficult, 
problem. We must allocate resources on the selected 
computers, start processes, and link these processes 
into a computation. Different computers differ widely 
in the mechanisms used for resource allocation and pro- 
cess creation, so a first requirement is to negotiate the 
appropriate mechanisms at each site. A second concern 
is that startiip can be a timeconsuming and error-prone 
activity; hence, we require techniques for detecting fail- 
ure (e.g., via timeout) and synchronizing once startup 
completes. These two concerns are addressed via . the 
use of the GRAM interface (discussed above) and an 
appropriate co-allocator WhraTy, respectively. MPICH- 
G uses the Dynamically-Updated Request Online Co- 
allocator (DUROC). DUROC submits requests, veri- 
fies correct startup, and provides functions that can 



then be used to coordinate the various subjobs so as 
to create (in our current case) a single MPI-COMM-WORLD 
spanning all processes. The need to reserve resources 
at multiple sites simultaneously remains as a problem; 
which we are investigating in current work. 

Authentication and authorization. A significant 
obstacle to the use of multiple distributed resources is 
that the user will typically have a distinct "trust re- 
lationship" (e.g., account), or even no prior trust re- 
lationship at all, at different sites. Hence, starting a 
program can be a frustrating process involving multi- 
ple logins. MPICH-G avoids this because the Globus 
Security Infrastructure supports single sign-on and au- 
tomatic mapping (under site control) to appropriate 
local accounts. Public key technology is used to avoid 
the transfer of plaintext passwords. 

. Executable staging. Manual staging of executables 
is another painful activity. MPICH-G overcomes this 
obstacle by using the Globus "Global Access Secondary 
Storage" (G ASS) service to stage executables to remote 
machines. Currently, this technique works only if the 
programmer has supplied an appropriate executable for 
each remote computer. In future work, the Globus 
group plans to investigate automated techniques for 
identifying and generating appropriate executables, for 
example by using compile servers. 

Communication. As described in an earlier pa- 
per [8] which focused specifically on riiultimethod com- 
munication in MPICH-G, the Nexus communication 
library is used to provide access to multiple com- 
munication methods [9]: e.g., TCP/IP in the wide 
area, vendor-specific protocols within a computer, and 
shared memory within a cluster. 

Monitoring, control, stdout. The globusrun util- 
ity used by mpirun also provides a number of other use- 
ful capabilities. Callbacks provided by GRAMs allow 
it to detect and report termination. Control functions 
provided by the GRAM API allow it to terminate a 
computation in the event of a user signal (control-C) 
or if a component fails. Finally, G ASS mechanisms are 
used to collect standard output and error streams and 
route these back to the originating terminal. 

5 Performance Studies 

An empirical evaluation of a library such as MPICH- 
G should, ideally, address at least the following issues: 
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1. Startup costs: What is the cost of the authentica- 
tion, authorization, resource location/allocation, 
and other managennent mechanisms? Are these 
mechanisms scalable? 

2. Communication costs: What is the impact of the 
multimethod communication support on point-to- 
point and collective communication performance, 
for both simple benchmark programs and real ap- 
plications, and in both homogeneous and hetero- 
geneous environments? 

3. Reliability: Are the management and communica- 
tion mechanisms provided able to operate reliably 
in wide area environments? 

We present here preliminary results for point-to- 
point communication performance in homogeneous sys- 
tems; optiniization in this configuration, and other 
measurements, are ongoing. We use the "ping-pong** 
benchmark programs provided with MPICH [15] to 
evaluate the performance of MPICH-G. We study per- 
formance oil an IBM SP2 system at. Lawrence Liver- 
more National Laboratory (LLNL). This system runs 
AIX 4.3.1 and is configured with four- way SMP nodes 
with 332 MHz PowerPC 604e processors. This config- 
uration provides 1.2 GB/s bandwidth to memory and 
150 MB/s switch bandwidth. All communication mea- 
surements are between processors on different nodes. 

We measured performance for five different commu- 
nication libraries: 

1. IBM-MPI, the npnthreaded IBM implementa- 
tion of MPL 

2. IBM-MPL, the IBM implementation of MPL, 
the original communication library provided on 
the IBM SP. 

3. MPICH-mpI. MPICH operating over the IBM 
MPL library. 

4. Nexus, the Globus communication library (also 
operating over the IBM MPL library in this situ- 
ation). 

5. MPICH-G, MPICH-G operating over the Globus 
communication library (which in turn uses the 
IBM MPL library). 

In addition, for each of these libraries we measured 
performance when operating over two different bind- 
ings for the IBM and IBM MPL library: one that uses 
the more efficient user space communication and one 
ba.sed on TCP/IP. Also, for Nexus and MPICH-G we 
evaluated the impact of two different values for the 



"skip-poir parameter, as discussed below. The results 
are presented in Tables 1 and 2. / . 

In brief, we find that when using user space commu- 
nication, MPICH-G incurs an overhead of 48 ^^secfor 
a zero-length message (when skip poll=10K) and 
achieves 35 percent of the peak bandwidth achieved 
by IBM's MPL These, are certainly not good results, 
but nor are they dreadful, and on the basis of previ- 
ous studies [12, 8], we believe that we understand the 
source of these overheads and know how to eliminate 
a significant part of them, by eliminating extra copies, 
improving memory management, and streamlining cer- 
tain interfaces. Overall, we believe that we can achieve 
performance close to that of MPICH-mpl in most sit- 
uations. 

The user space results for Nexus and MPICH-mpl 
provide some insights into the nature pf the overheads. 
The zero-byte latency for Nexus is 42 /isec, while that 
for MPlCH-mpl is only 32 Msec; this difference reflects 
certain known overheads associated with the Nexus 
communication model and implementation [12]. But 
the bulk of the overhead (31 /isec) is clearly associated 
with the layering of MPICH-G on Nexus, something 
that we have not optimized carefully. The bandwidth 
numbers for Nexus and MPICH-G are identical, indi- 
cating that the overheads here lie in Nexus. The source 
of this overhead is additional copies performed in the 
Nexus system on send and receive. These can be cor- 
rected, but the necessary optimizations have not yet 
been performed. 

When using TCP/IP for communication, MPICH- 
G incurs a similar overhead for zero-length messages 
(69 /isec) but now attains 61 percent of the band- 
width achieved by IBM's MPI, The overheads asso- 
ciated with the layering of MPICH-G over Nexus and 
the bandwidth behaviors seen for Nexus and MPICH-G 
are comparable to those seen in the user space case. 

We comment finally on the significance of the skip 
poll parameter- As discussed elsewhere [9], the perfor- 
mance of multimethod systems that depend on polling 
to detect incoming communications can be sensitive to 
the frequency with which different interfaces are polled. 
In the current case, a user space poll is cheap (less than 
one /isec), while an IP poll can cost 10s of microsec- 
onds. Hence, a simple round-robin strategy that polls 
the two interfaces in sequence will often delay the pro- 
cessing of incoming user space communications. We 
allow the user to control the polling strategy used by 
providing a parameter "skip-poll" that specifies how 
many "fast" polls are performed before a slow poll is 
performed. Hence, a very large skip-poll value such 
as 10,000 is a close approximation to the case when 
the slow protocol is not used at all, while skip-poll=0 
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Table 1. Preliminary performance results for MPICH-G: One-way message times on the LLNL IBM SP2 



Communication 


Skip 


Latency 


Time (/isec) vs. Msg Size (bytes) . 


Library 


poll 


(fisec) 


10 


100 


IK 


lOK 


lOOK 


IM 


User space communication: 


IBM-MPI 




25 


27 


32 


64 


284 


1745 


12714 . 


IBM-MPL 




24 


26 


30 


63 


235 


1673 


. 12681 


MPICH-mpl 




32 


33 


44 


75 


233 


1630 


12888 


Nexus 


lOK 


42 


44 


48 


88 


356 


. 3944 


35252 


MPICH-G 


lOK 


73 


76 


80 


121 


363 


3249 


35813 


Nexus 


0 


161 


162 


167 


224 


701 


6424 


59886 


MPICH-G 


0 


360 


362 


368 


443 


958 


6458 


57016 


TCP/IP-based communication: 


IBM-MPI 




131 


134 


143 


251 


976 


4850 


35272 


IBM-MPL 




129 


133 


141 


251 


718 


4542 


35061 


MPICH-mpl 




184 


184 


290 


393 


966 


5800,. 


35348 


Nexus 


lOK 


160 


163 


173 


293 


899 


6993 


57557 


MPICH-G 


lOK 


200 


206 


218 


340 


989 


7058 


58092 


Nexus 


0 


287 


289 


294 


430 


1109 


7856 


6282a 


MPICH-G 


0 


530 


544 


558 


693 


1429 


8141 


62443 



Table 2. Preliminary perfonnance results for MPICH-G: Bandwidths on the LLNL IBM SP2 



Communication 


Skip 


Latency 


Bandwidth (KB/sec) vs. Msg Size (bytes) 


Library 


poll 


(/isec) 


10 


100 


IK 


lOK 


lOOK 


IM 


User space comnnunication: 


IBM-MPI 




25 


349 


3034 


15142 


34381 


55935 


76809 


IBM-MPL 




24 


370 


3219 


15396 


41401 


58358 


77005 


MPICH-mpl 




32 


292 


. 2211 


12975 


41868 


59882 


75769 


Nexus 


lOK 


42 


221 


1995 


10975 


27366 


24757 


27701 


MPICH-G 


lOK 


73 


128 


1217 


8067 


.26896 


30051 


27268 


Nexus 


0 


161 


60 


583 


4355 


13918 


15200 


16312 


MPICH-G 


0 


360 


.26 


265 


2201 


10184 


15121 


17127 


TCP/IP-based communication: 


IBM-MPI 




131 


72 


681 


3884 


10003 


20132 


27686 


IBM-MPL 




129 


73 


688 


3882 


13594 


21498 


27853 


MPICH-mpl 




184 


52 


336 


2481 


i0099 


16834 


27626 


Nexus 


lOK 


160 


59 


563 


3331 


10854 


13964 


16966 


MPICH-G 


lOK 


200 


47 


446 


2864 


9869 


13835 


16810 


Nexus 


0 


287 


33 


331 


2271 


8801 


12430 


15543 


MPICH-G 


0 


530 


17 


174 


1407 


6833 


11994 


15639 
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corresponds to round-robin polling. We see from Ta- 
bles 1 and 2 that the round-robin strategy performs 
. significantly worse than skip-poll=0. Fortunately, ex- 
perience shows that even quite small skip-poll values 
can provide acceptable overheads while providing ea- 
sonable responsiveness for the different methods. 

6 Future Work 

. We are working with colleagues to extend the. 
MPICH-G implementation in a number of areas. 

Shared-memory support. To date, we have ex- 
. plored the use of just two coinmunicatioh methr 
ods: user-space communications within an MPP and 
TCP/IP between MPPs. On computers such as the 
IBM SP, we can also exploit more efficient shared mem- 
ory communications within SMP clusters, hence pro- 
viding a total of three different communication meth- 
ods. We are working with colleagues at USC/ISI to 
implement and evaluate this strategy. 

Topology^aware communication operations. In 
heterogeneous grid environmentSj collective operations 
such as MPIJIEDUCE can execute significantly faster 
if their implementation takes advantage of knowledge 
of the underlying system topkjlogy. For example, an 
MPI JIEDUCE operation in the environment of Figure 1 
might well first reduce within each SMP node, then 
within each MPP, and finally across MPPs. In order to 
implement such optimizations, the MPIGH implemen- 
tation requires information about the topology of the 
underlying machine. We are vi^orking with colleagues at 
LLNL to identify the required information and will ex- 
tend the Globus device with additional functions that, 
provide this information. 

User-level communication sthictures can also take 
advantage of topology information. In principle, MPPs 
topology operations provide a basis for providing this • 
information to applications. We plan to study whether 
these operations are indeed appropriate, or whether 
MPI extensions are needed to allow programmers to 
implement efBcient applications in wide area environ- 
ments. 

Looking further into the future, we are interested 
in exploring more sophisticated techniques suitable for 
true wide area operation, for example exploiting Nexus 
support for multicast [21] and using network perfor- 
mance information (e.g., (22]) to adapt a combining 
tree structure in response to changing network loads. 

MPI-2 extensions. The MPI-2 revisions to the MPT 
standard introduce a number of new features, including 



single-sided operations, dynamic process creation and 
attachment, and parallel I/O. All three of these exten- 
sions can, in principle, be incorporated into MPICH- 
G easily: The Nexus communication library used in 
MPICH-G provides a single-sided communication op- 
eration as a primitive; Globus mechanisms support dy- 
namic process creation and attachment; and a remote 
I/O binding for MPI-IO has already been developed. 
However, numerous details remain to be worked out in 
each of these areas, and the MPICH framework itself, 
must be extended to support these new features. 

7 Summary 

We have described MPICH-G, an implementation of 
the Message Passing Interface that uses services pro^ 
vided by the Globus toolkit to allow the use of MPI 
in wide area environments. MPICH-G masks details 
. of underlying networks and computer architectures so 
that diverse distributed resources can. appear as a sin- 
gle "MPI.COHMJtfORLD," Any arbitrary MPI application 
can be started on heterogeneous collections of machines 
simply by typing mpirun: authentication, authoriza- 
tion, executable staging, resource all6<:ation, job cre- 
ation, startup, and routing of stdout and stderr are all 
handled for free. 

We believe that MPICH-G is interesting not only 
in its own right but also as a demonstration and test 
case for Globus services. MPICH-G was constructed by 
adapting MPICH, a widely used MPI implementation 
for workstations and MPPs. This adaptation involved 
the use of various Globus tools, for security, remote file 
access, synchronized startup, and multimethod com- 
munication. Relatively few changes to MPICH were 
required to support the use of these tools. • 

MPICH-G passes the MPICH test suite and is hence 
ready for broad distribution and use. Work is continu- 
ing on point-to-point performance optimization, appli- 
cation development, and research investigations relat- 
ing to collective operation performance, network topol- 
ogy information, MPI-2 implementation, and other is- 
sues. 
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Abstract — Reservation and adaptatioD are two well-known and efTective 
techniques for enhancing the end-to-end performance of networlc applica- 
tions. However, hotb techniques also have limitations, particularly when, 
dealing with high-bandwidth, dynamic flows: fixed-capability reservations 
tend to be wasteful of resources and hinder graceful degradation in the face 
of congestion, whOe adaptive techniques fan when congestion becoincs ex- 
cessive. We propose an approach to quality of service (QoS) that overcomes 
these difficulties by combining features of reservations and adaptatioD. In 
this approach, a combination of online control interfaces for resource man- 
agement, a sensor permitting online monitoring, and decision procedures 
embedded in resources enable a rich variety of dynamic feedback interac- 
tions between applications and resources. We describe a QoS architecture, 
GARA, that has been extended to support these mechanisms, and use three 
l^amples of appiication-levd adaptive strategics to show how this frame- 
Itvork can permit applications to adapt both their resource requests and 
behavior in response to online sensor information. 

I. Introduction 

Network applications that need to achieve reliable end-to-end 
performance typically make use of either reservations or adap- 
tation. When using reservations, applications usually specify 
quality of service (QoS) requirements when a connection is es- 
tablished and do not change them subsequently; the QoS system 
in turn guarantees that (modulo system failures or preemptions) 
the reservation will not be reduced during the lifetime of the 
application [1], [2]. In contrast, applications that use adaptation 
do not make reservations but instead adapt to the network condi- 
tions at hand by responding to some form of feedback, whether 
explicit (notification of network conditions) or implicit (noticing 
that bandwidth is low). Adaptation may occur when the appli- 
cation detects a problem or when the application is notified that 
a problem may exist [3], [4], [5], [6], [7]. 

Both reservations and adaptation have been proven effec- 
tive in many situations, but also have significant limitations: 
particularly when dealing with high-end applications featuring 
high-bandwidth, dynamic flows. Fixed-capability reservations 
can waste bandwidth and do not permit graceful degradation 
in application performance when resource management policies 
mandate changes in allocations. Adaptive techniques inevitably 

I fail when congestion reduces available resources below accept- 
able limits [8], [9]. 
In this paper, we describe an approach to QoS that combines 



features of both reservations and adaptation to address the diffi- 
culties just noted.. At the core of this approach is a QoS archie 
tecture In which resources are enhanced with: 

• online control interfaces that allow applications, or agents 
acting on their behalf, to modify resource characteristics (e.g., 
reservations) dynamically; 

• sensors that allow applications (and agents) to detect when 
adaptation is required; and . . 

• decision procedures that support the expression of a rich set 
of resource management policies. 

These mechanisms in turn enable a wider range of 
application-level adaptation strategies than are supported in 
other architectures. For example, online control of reservations 
allows applications to request premium service when adaptive 
techniques fail to deliver; monitoring of reservations that change 
as a result of decision procedures embedded in resource man- 
agers allows for graceful degradation in application performance 
in response to preemption. 

To explore these ideas, we have incorporated such mecha- 
nisms into a QoS architecture developed in previous work — 
the Globus Architecture for Reservation and Allocation 
(GARA) [10], [1 1]. We have completed a prototype implemen- 
tation of this enhanced architecture, which has been deployed 
by ourselves and others on local and national testbeds. 

We hypothesize that the mechanisms and associated control 
and information flows provided by this extended GARA archi- 
tecture can be exploited to obtain more efficient resource us- 
age than in purely reservation-based or application-based ap- 
proaches, as applications can vary reservations and rates; to pro-, 
vide more flexible resource allocation strategies, as resources 
can change allocations over the course of a reservation; and to 
deliver more robust application perfonhance, as applications can 
detect and respond to changes in allocations and resource state. 

As a first step towards testing this hypothesis, we have used 
GARA mechanisms to implement three different adaptive strate- 
gies. The first two use a flow-specific packet loss sensor to adapt 
bandwidth requests to the QoS system in order to meet perfor- 
mance targets, for UDP and TCP flows, respectively; the third 
uses a sensor that provides, information on changes in reserva- 



tion level (as a result of preemption) to adapt transmission rate 
for bulk data transfer applications. In each case, we present 
novel decision procedures and demonstrate that we can deliver 
interesting adaptive behaviors via a combination of onhne mon- 
itoring and control. 

In the rest of this paper, we review the QoS requirements of 
high-end applications, describe our enhanced GARA architec- 
ture and present our three adaptive strategies and the expen- 
mental studies that we have performed to evaluate their effec- 
tiveness. We conclude with a brief discussion of related and 
futurework. 

II. Motivation: High-End APPLICATIONS 



We are interested in providing QoS mechanisms for high^end 
network applications [1 1], in which individual flows cah have 
high bandwidth, from a few megabits per second (Mb/s) to many 
tens or hundreds of Mb/s; there may be complex mixes of flows, 
from low bandwidth to high bandwidth and from low latency to 
high latency; aiid flows may change their reqiiirements dynami- 
cally throughout their lifetime. 

Applications with these characteristics arise in such areas 
as distance visualization, analysis of petabyte-scale scientific 
databases, online control of scientific instrumentation, and 
teleimmersion [12]. For illustrative purposes, we examme a 
teleimmetsion example in more detail. Consider two or more 
users at geographically separate locations who are exploring col- 
laboratively a three-dimensional visualization of expenmental 
data As in other telecollaboration systems, we have a num- 
ber of streams with fairly constant rate and low to moderate 
bandwidth- audio and video streams for conununication. and 
jitter- and latency-sensitive streams for the tracking data indi- 
cating user movements in the virtual space. In addition, we have 
streams with higher bandwidth and often variable rates, used for - 
visualization data and (in some cases) database updates. Visual- 
ization data is calculated from the data set, and a representation 
of it perhaps a set of polygons for rendering, is transmitted [13 J. 
The actual amount of data sent depends on both the data being 
visualized and user actions, which may include zooming and 
movement in space and time. Contention for shared resources 
such as disk and CPU can also affect the transnussion rate. . 

These characteristics place substantial demands on both net- 
work infrastructure and applications. For example, consider a 
situation in which several teleimmersion sessions are m oper, 
ation simultaneously, while other groups are concurrently at- 
tempting to perform high-speed bulk-data transfers over.the 
same networic infrastrucwre, perhaps to stage data required for 
an experiment later in the day. With today's protocols and ser- 
vices, no group would obtain acceptable service. 

We believe that concerns such as these require that resource 
providers be able to specily and implement flexible resource al- 
location policies. For example, in the situation just noted, re- 
source providers might allocate resources to different te eim- 
mersion sessions and bulk-data transfers difFercntially. Teleim- 
mersion session A might have priority, while sessions 5 and C 
would be guaranteed some minimum service. Bulk-data trans- 
fers £) and £ would have lowest instantaneous prionty but would 



be guaranteed service in terms of another 'terabytes per hour" 

metric. , r ^ 

We also believe that a policy-driven framework of this sort 
can be effective only if applications themselves are provided 
with the information and control flows required to detect and 
adapt to policy-driven changes in resource allocations. For ex- 
ample, a teleimmeision session could respond to reduced (m- 
creased) resource availability by reducing (increasing) video 
rates or introducing (eliminating) data compression to noncntir 
cal users, while a bulk-data transfer could reduce (increase) its 
sending rate. The architecture that we present in tixis paper en- 
ables these sorts of adaptation. 

III. Reservation and Adaptation Combined 
Effective adaptive control requires toee distinct mechanisms.- 
In the language of [14], these are 

• actuators that permit online control, for example, of resource 
allocations or application behavior, 

. sensors Aat permit monitoring, for example, of resource alio- 
cations or application behavior; and 

. decision procedures that allow entities to respond to sensor 
information, by invoking actuators. 

As illustrated in Figure 1. these three elements act in concert 
to achieve adaptive control. For example, a sensor might signal 
a nonzero loss rate associated with a flow at a router. A deci- 
sion procedure in the associated application can then execute to 
determine whether to reduce the sending rate or, alternatively, 
generate a request to a resource manager to create (or mcrease) 
a reservation for that flow, hence invoking an actuator. 

In this section, we first provide an overview of the GARA 
architecnireandthenexplainhowwehaveextendedit to support 

these three mechanisms. 



A, GARA Overview 

The Globus Architecture for Reservation and Allocation pro- 
vides advance reservations and end-to-end management for 
quality of service on different types of resources, inchidmg net- 
works, CPUs, and disks [10], [1 1]. 

A GARA system comprises a number of resource managers 
that each implement reservation, control, and monitoring oper- 
ations for a specific resource. Resource managers can and have 
been implemented for a variety of resource types, hence the use 
of the tenn '^resource manager" rather than the more specific 
"bandwidth broker'* favored in the networking literature [15]. 
Uniform interfaces allow applications to express QoS needs for 
different types of resources in similar ways, hence simplify- 
ing the development of end-to-end QoS management strategies. 
Mechanisms provided by the Globus toolkit are used for secure 
authentication and authorization of all requests to resource man- 
agers. An information service allows applications to discover 
resource properties such as cunrent and funire availability. 

The work described in this article involves just a single type 
of resource manager, namely, one that uses differentiated ser- 
vices mechanisms 116J to implement network QoS. This re- 
source manager uses the expedited forwarding per-hop behav- 
ior (PHB), as specified by the Internet Engineenng Task Force s 
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Fig. 1. An example of how actuators, sensors, and decision procedures may be combined to provide adaptive control. We illustrate reservation adaptation in 
Application A, occuring as a result of a packet loss notification received fiom a router via the resource manager. The operation of this strategy is described in 
the text, as is the bulk transfer decision procedure that is also shown. . 



(IETF) Working Group on Differentiated Services, to provide a 
premium service. With careful admission control at the edge of 
the network, it is possible to build a networic QoS system with 
reasonably strong bandwidth guarantees, even though traffic is 
treated as an aggregate in the core of the network. 

The resource manager enables reservation requests (see be- 
Jow) by configuring the routers that it controls. In particular. 
It config^nres the ingress routers to classify, police, mark, and 
potentially shape, all packets that belong to a flow for which 
a resenration has been authorized, as is normally done for dif- 
ferentiated services. The expedited forwarding per-hop behav- 
ior drops packets that exceed the reservation, but allows small 
bursts of excess traffic using a token-bucket mechanism. 

B. Actuators: Online Control 

A first prerequisite for adaptation is support for online control 
of resource characteristics. (We are also interested in online con-, 
trol of application behavior, but that topic is beyond the scope of 
this article.) GARA supports this requirement direcdy via con- 
trol functions that allow an application — or an agent acting on 
its behalf— to make and subsequently modify QoS reservations. 

In the case of the network resources considered in this arti- 
cle, an application request to the resource manager specifies a 
start lime and duration for the desired reservation; the IP ad- 
dresses of the end hosts that will be communicating; the band- 
width required for the reservation; and the network protocol that 
will be used (TCP or UDP) [10]. Since reservations may be 
made in advance, not all information may be known at the time 
the reservation is made. In particular, an application may not 
know what port numbers will be used for conununication un- 
til network communications begin. Therefore, GARA provides 
a "bind" operation, which simultaneously "claims" the reserva- 
tion and provides this run-time information. 

Both immediate and advance reservations are supported. Ad- 
vance reservations simplify co-scheduling of scarce resources 



and help to ensure that resources are available for important 
events, such as scientific experiments. 

GARA allows third parties to make, monitor, and modify 
reservations on. behalf of an application. This capability allows 
us to separate adaptation logic from an application proper; in 
the case of advance reservations, it means that an application 
need not be running when a reservation is made. For brevity, we 
frame subsequent discussion as if only applications manipulate 
ireservations; however, in practice, a third party can always be 
substituted. 

C. Sensors 

A second reqxiircment for adaptive control is that we be able 
to determine the state of system components and detect state 
changes. This capability is provided via sensors associated with 
system entities to which other entities can subscribe, with noti- 
fications provided via some form of event service or callback 
mechanism. We have impleihented two such sensors in our 
GARA prototype. 

C.l Loss rate sensor. 

This sensor provides applications with information on packet 
loss rate in the network. This information can serve to indicate 
the application is either sending too fest or has an inadequate 

reservation. 

We measure packet loss rates at the first hop router: that is, 
the router at which initial policing is performed by our differ- 
entiated services implementation. Our resource manager peri- 
odically queries this router, which because of its classification 
and policing role is able to provide statistics about the number 
of packets that have exceeded a fiow^s reservation. 

The query to the router returns the number of packets that 
confonned to the reservation and were not dropped (pc) and 
the number of packets that exceeded the reservation and were 
dropped (pc), both of these quantities being since the last time 



the statistics were queried. If the resource manager detects a 
nonzero value then it generates a callback to notify any sub- 
scribed processes that packet loss has occurred. This callback 
specifies both an estimated loss percentage and the currently un- 
allocated bandwidth; an application might use the latter quantity 
as a guide when deciding whether to respond to a packet loss no- 
tification by atten^)ting to increase its reservation vs. changing 
its behavior 

In computing the estimated bandwidth, we must deal with the 
complicating factor that the router uses a token bucket of size 
P6 to allow small bursts. The router updates its statistics only 
periodically (roughly every 10 seconds) and the resource man- 
ager cannot know if the token bucket was fiill or empty when 
. the statistics were gathered. To avoid persistent underestimates 
of loss rates, we assume that the token bucket is at least half-full 
and reduce the number of conforming packets correspondingly. 
. This adjustment is reflected in our formula for estimated fraction 
of packets that were dropped: 

• p=_j : 

Pc+Pb/2+.pe 

We describe in Section IV how this sensor can be used to 
modify QoS reservations to meet application requirements, for 
both UpP and TCP flows. 

C. 2 Reservation change sensor. 

Our second sensor is used to publish information about 
changes in resource allocations. The reason for these changes 
is described in the next subsection; here we note simply that we 
have a sensor capable of communicating such changes to inter- 
ested entities. 

D. Decision Procedures 

The third component of an adaptive control architecture com- 
prises the decision procedures that invoke actuators in response 
to sensor data. 

In our environment, such decision procedures can occur in 
multiple locations. They clearly arise in applications, and in- 
deed we give three such examples below. Decision procedures 
can also occur in resource managers; this can lead to interesting 
interactions. 

Decision procedures may be invoked within a GARA re- 
source manager at a number of points. Following authentica- 
tion, an incoming request is first authorized and then executed. 
Decision procedures may be invoked at both stages: for exam- 
ple, to determine whether a request should be granted, in the 
first instance, and to reallocate resources iii the second instance 
if the newly authorized reservation oversubscribes available re- 
sources. 

To explore tfiese ideas and demonstrate our ability to incor- 
porate decision procedures in resource managers, we have im- 
plemented the following simple but highly effective procedure. 

D. 1 Bulk-data transfer procedure. 

As noted above, bulk-data transfer (BUT) operations have ser- 
vice requirements expressible in terms of "terabytes per hour" 



rather than "Mb/s " Satisfying such requirements in the face of 
congestion can require the use of premium service but need not 
always pre-empt other applications requiring premium service. 

Our BDT decision procedure is designed to exploit this ob- 
servation. In effect, it implements two classes of premium ser- 
vice, foreground and background, within a single premium ser- 
vice class. It does this by applying the following simple deci- 
sion rules when processing requests to create, bind, or temiinate 
reservations. 

1. Create foreground reservation: Creation of a foreground 
reservation is authorized if at no time during the reservation pe- 
riod the sum of all foreground reservations would exceed the 
total available premium bandwidth. 

2. Bindfor^und reservation: Binding of a foreground reser- 
vation results in the requested bandwidth being allocated to the 
appropriate flow. If necessary, premium bandwidth is preempted 
from background flow(s), with callbacks being generated to no- 
tify interested parties. 

3. Cancel reservation: The freed bandwidth is allocated to 
background flows with inadequate allocations, if any such ex- 
ist, and callbacks are generated. 

4. Create background reservation: Creation of a background 
reservation is always allowed. 

5. Bind background reservation: Binding of a reservation re- 
sults in a "fair share" of the unallocated premium bandwidth 
being allocated to the appropriate flow. (See below for a de- 
scription of how this fair share is calculated.) 

We describe below how an application can use the reservation 
change sensor triggered by this decision procedure to achieve 
sustained BDT rates without impeding foreground flows. 

IV. Application-Level Adaptation Procedures 

We now describe the three application-level adaptation pro- 
cedures that we have developed to date. 

A, The GARA Testbed 

AH experiments reported below were performed in the testbed 
shown in Figure 2. The testbed consists of three Cisco 7507 
routers interconnected with 155 Mb/s (OC-3) ATM, Hosts are 
connected to the routers with 100 Mb/s switched Ethernet All 
hosts used in our tests were Sun Ultra 60s. In addition, virtual 
circuits to several remote sites permit wide area experiments. 

Cisco's Modular QoS command line interface (MQC) is used 
for two different purposes. On the ingress interfaces to the net- 
work, it is used to classify, police, and mark packets. Within 
the interior of the network, it is used to enable Weighted Fair 
Queuing (WFQ) to give priority to mariced packets. 

B. Adaptive QoS Reservations: UDP Flows 

We first describe how adaptive techniques can be used to de- 
termine the bandwidth reservation required to support a partic- 
ular UDP flow. The motivation for this use of adaptation is that 
many application developers have.no knowledge or QoS mech- 
anisms or of the principles by which QoS parameters are deter- 
mined. We show that infomiation provided by a simple packet 
loss rate sensor can be used to guide a decision procedure that 
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. Fig. 2. The GARA Network Testbed (GARNET). The core of the testbed consists of duec Cisco 7507 routers. There are several computers on each end of die 
testbcd, more than shown here. Note the coimections to three wide area networks. 



sets bandwidth reservations adaptively, increasing reservations 
until loss rates reach zero. This decision procedure can be in- 
corporated in an application or in a separate agent 

Our decision procedure uses information provided by the 
lacket loss' rate sensor described in Section Ill-C. Recall that 
this sensor periodically generates an estimate of the fraction of 
packets dropped, P; hence, 1 — P is the fraction of packets that 
conformed to the reservation. Our decision procedure calculates 
what reservation would have been needed to make such that no 
packets would have been dropped, as follows: 



iin(l-P)=i?o 



or 



i2n = 



Ro 



1-p 



where Ro is the old reservation and i2n is the new reservation. 

To evaluate the effectiveness of this strategy, we performed 
experiments as follows. In order to obtain a replicable experi- 
ment, we used as our application a test program that sends UDP 
traffic at a user-specified rate across our testbed. 

Results for two similar experiments are superimposed in Fig- 
ure 3. In each case, the application made an initial reservation 
for 2500 kilobytes per second (KB/s) but then sent data at a 
higher rate: in the first case at 4000 KB/s and in the second 
case at 8000 KB/s. As described before, the first router clas- 
sified, policed, and marked traffic. Because the router allows 
mall bursts, the application initially was able to send slightly 
faster than the reservation allowed, but then the data rate settled 
down to a constant 2500 KB/s. 



Our loss rate sensor is implemented by the GARA resource 
manager, which queries the router every ten seconds and pro- 
vides feedback to the application for every query except 4&e 
first. (The first query is not reported to the application because 
we wish to gather statistically sufficient data.) As the resource 
manager and application are not synchronized in any way, we 
should not be surprised that the feedback arrives at slightly dif-r 
ferent times in the two cases: at 16 seconds and 22 seconds, 
respectively. 

It is clear from Figure 3 that the UDP application was able to 
adapt quickly in these experiments. However, the poor tempo- 
ral resolution offered by our routers means that adaptation need 
not always work so well. For example, if the router statistics 
were gathered just as a series of packets were starting to be 
dropped, a unrepresentative result may be reported to the ap- 
plication. However, this problem would be compensated for af- 
ter another round of adaptation. In addition, our router updates 
statistics only every ten seconds, which limits the frequdicy at 
which the resource manager can check them. 

C. Adaptive QoS Reservations: TCP Flows 

We should not be surprised that it is possible to determine 
UDP transmission rates by monitoring packet loss informa- 
tion, given that UDP does not perform congestion control. Im- 
plementing a comparable adaptive strategy for TCP is signif- 
icantly more complex because of TCP's self-clocking mecha- 
nisms. Data that an application attempts to write into a socket 
buffer with a specific rate may not be transported immediately 
because TCP's sliding window protocol requires that acknowl- 
edgments be received before further data is sent. Also, TCP 
slows its sending rate when it believes it has encountered con- 
gestion. (In our case, TCP has not encountered congestion, but 
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Fig. 4. An example of how our search-based reservation rate strategy deter- 
mines the correct reservation to use for a TCP application. The application 
doubled its reservation request twice, then nanowed its reservation with a 
bin^ search. The difference between the gross and net reservations is due ' 
. to packet overheads. . . ' - 



Fig. 3. Performance of our UDP reservation adaptation strategy in two different 
cases. In the first case, the application is sending at 4000 KB/s while m the 
second it is sending at 8000 KB/s. In both cases> an hutial reservation of 
approxhnately 2S00 KB/s is corrected after a single round of adaptation. 



an aggressive QoS policing mechanism.) Nevertheless, TCP is 
used extensively in the applications that interest us, and so it is 
important to support TCP if we can. 

Because of these difficulties, our decision procedure for TCP 
does not attempt to derive the transmission rate from the packet 
loss rate ratio. Instead, it uses a search procedure to determine 
the correct rate. When the packet loss rate sensor signals that 
packets have been dropped, we simply double the reservation. 
Once the reservation is large enough, we perform a binary search 
between the current reservation and the previous reservation un- 
til we arrive at a reservation that works and that has not changed 
from the previous reservation by more dian five percent. 

Figure 4 illustrates the results that we obtain with this heuris- 
tic. We see that the search takes some time to adapt but even- 
tually comes close to a correct value. The time delay is largely 
because statistics on dropped packets are reported only every 10 
seconds on our routers. Clearly, decreasing that interval would 
improve the adaptation time. Nevertheless, even this relatively 
long adaptation time is quite acceptable for many of our long- 
lived target applications. 

There are a couple of possibilities for improving upon this 
search. One possibility is to change the initial doubling by es- 
timating the correct multipier from the percentage of dropped 
packets, much as we did in the UDP case. We have performed 
extensive experiments with such techniques but have not yet 
succeeded in identifying a good estimate for the multiplier, be- 
cause of TCP's complex behavior Recent work proposes mod- 
ifying TCP's windowing algorithm to be aware of reservation 
rates [17], 1 1 8]. 

It may be possible to adapt more quickly by monitoring closer 
to the application. In particular, if the application used an instnir 



mented TCP library that could measure the rate at which the ap- 
plication was attempting to send, it could use the adjustment to 
adapt much more quickly. However, this strategy requires mod- 
ifications to the application: our heuristicajiaye the advantage 
of being usable by a third party agent. 

Z>. A Bulk-Data Transfer Application 

Our third example of an application-level adaptation proce- 
dure uses our BDT reservation-change sensor to guide rate adap-^ 
tation for BDT applications. As described in Secdon III-D, this 
sensor signals changes in backgrond flow reservations due to 
preemption by (or termination of) higher-priority foreground 
f^ows. Our decision procedure simply adapts the transmission 
rate of the TCP-based bulk data transfer application in order to 
achieve throughput close to the bandwidth allocated to the BDT 
flow. Note that in the absence of this decision procedure, the 
achieved throughput would tend to be extremely low, because 
preemption lowers the background flow's reservation and pack- 
ets that exceed the reservation are dropped, therefore triggering 
TCP's backoff algorithms. 

For our experiments, this decision procedure was incorpo- 
rated into a QoS-aware TCP-based.BDT application. Figure 5 
shows results obtained in a wide area testbed between Ai^onne 
National Labratory and Lawrence Berkeley National Labratory. 
At about time 5, the (background) BDT application began and 
was assigned all of the premium bandwidth, 25 MB/s. At ap^ 
proximately times 40 and 100, a foreground reservation began 
and the' BDT reservation was reduced When the foreground 
•reservations ended, the background reservation was increased. 
Notice that at time 15, competitive UDP traffic began but does 
not interfere with either the foreground or background reserva- 
tions. 

These results show that we are successful in adapting the BDT 
flow in response to information concerning preemption by fore- 
ground flows. Apart from a few artifacts, the BDT flow main- 



SOOQO 




0 20 40. «0 so 100 120 140 160 180 

Time(s) 



Fig. 5. An exampte of bulk-transfer in our wide area testbed. See text for details. 

tains data transfer at a rate close to the amount of premium band- 
width allocated to that flow. The artifacts can be explained as 
follows. First, we see that each time the BDT reservation is re- 
duced, the BDT rate drops momentarily more than expected and 
then recovers. We attribute this behavior to the fact that TCP 
shrinks its window size when packets are dropped (when the 
reservation is changed before the application adapts), either by 
falling into its slow-start phase or into its congestion avoidance 
phase. 

H In addition, the application is using large socket buffers to 
Obtain high performance over the wide area testbed and when 
it enters slow-start mode (because packets have been dropped 
once the reservation decreases) these socket buffers quickly fill 
up. As TCP increases its congestion window size exponen- 
tially during the slow-start phase, data is immediately available 
to send; and TCP sends the data as increasingly larger bursts, 
until the socket buffer is emptied. Because the former con- 
gestion window size did not reflect the actual amoimt of data 
transmitted, the length of the slow-start phase after a drop is too 
long, therefore, data is initially sent too r^idly for the updated 
router configuration, forcing packets to be dropped and TCP to 
go into slow-start mode again, until the congestion window be- 
comes more appropriate. This effect is magnified by the larger 
bandwidth-delay product and hence larger socket buffers (1 MB 
in this case) in the wide area networic. 

V. RELATED WORK 

There has been a great deal of research on rate adaptation 
for network applications when reservation mechanisms are not 
present. For example, Goel et al. [7] describe a modular frame- 
work that provides feedback for not only network streams but 
also CPU scheduling. The present paper takes its terminol- 
ogy of actuators, sensors, and decision procedures from another 
feedback infrastructure, Autopilot [14], which has been used for 
dynamic performance tuning in various settings, including I/O. 

I Our approach follows the concept of detaching the "controller" 
|h)m the the application, as proposed in [3]. 
Implementing QoS-aware middleware is addressed in several 



projects. The Adaptive Quality of Service Architecmre for dis- 
tributed multimedia applications (AQUA) [19] introduces- ab- 

. stract interfaces for QoS measurements and negotiation. How- 
ever, this work focuses on ATM^connections and how to ensure 
QpS under competition on the end-system. 

The (Juartz architecture [20] provides a CORBA-based QoS 
framework. It introduces agent-based adaption and a resource 

. trader, called a balancing agent, which tries to compensate for . 
the loss of resources by increasing the amount requested. 

VI. Conclusions AND Future Work 

We have argued that advanced network applications such as 
teleimmersion, bulk data transfer, and distance visualization can 
benefit from mechanisms that enable the coordinated use of 
reservation and adaptation, via support for dynamic feedback 
among entities involved in making resource management de- 
cisions. We have described an implementation of such mech- 
anisms within the GARA resource management architecture. 
In this implementation, sensors associated with resource and ' 
resource managers' permit application-level monitoring of re- 
source state and reservation status, while online control mech- 
anisms enable adaptive control of reservations. We have used 
these mechanisris to develop three different application-level 
adaptive control mechanisms: two that use loss rate information 
to adapt reservations and one that uses reservation state infor- 
mation to adapt transmission rate. 

We find these initial results encouragmg^ but recognize that 
much more work remains to be done. For example, we would 
like to experiment with more sophisticated resource-side allo- 
cation policies and determine to what extent applications can 
adapt to these policies in interesting ways." In more complex 
multidomain environments, performance feedback and adapta- 
tion become more complex, not least because relevant sensor 
infonnation may not be easily accessible. Finally, experimenta- 
tion with a wider range of applications is required. 

Acknowledgments 

We gratefully acknowledge assistance given by Linda Win- 
kler and Becca Nitzen with the testbed used in these experiments 
and by Andy Adarason who wrote the UDP traffic generator. 
Numerous discussions with our colleagues Gary Hoo, Bill John- 
ston, Carl Kesselman, and Steven Tuecke have helped shape our 
approach to quality of service. We also thank Cisco Systems for 
an equipment donation that allowed the creation of the GAR- 
NET testbed. This work was supported in part by the Mathemat- 
ical, Information, and Computational Sciences Division subpro- 
gram of the Office of Advanced Scientific Computing Research, 
U.S. Department of Energy, under Contract W.31-109-Eng-38; 
by the Defense Advanced Research Projects Agency under con- 
tract N66001-96-C-8523; by the National Science Foundation; 
and by the NASA Information Power Grid program. 



References 

(IJ L. Wolf and R. Stcinmctz, "Concepts for reservation in advance," Khiwer 
Journal on Muhimedia Tools and Applications, vol. 4, May 1997. 

[2J p. Fccrari, A. Gupta, and G. Ventre, "Distributed advance reservation of 
real-time connections " ACM/Springer Verlog Journal on Multimedia Sys- 
/eiM, vol. 5, no. 3, 1997. 
' [3] B. Li and K.Nahmedt, "A Control>based Middleware Framcwotk for 
Quality of Service Adaptations,*" IEEE Journal of Seleaed Areas in Com- 
munications, Special Issue on Service Enabling Platforms, June 1999. 

[4] B. Li and K. Nahretcdl, "QualProbes: Middleware QoS Profiling Services 
for Configuring Adaptive Applications," in Proceedings of IFIP Interna- 
tional Conference on Distributed Systems Platforms and Open Distributed 
Processing (Middleware 2000), 2000, 

[5] X. Wang and H. Schulzrinne, **Comparison of Adaptive Internet Multime- 
' dia Applications," Institute of Electronics, Information and Communica- 
tion Engineers Transactions, vol. E82-B, pp. 806-818, June 1999. 

[6] D. Sisalem and H. Schulzriimc, 'The Loss-Delay Adjustment Algorithm: 
A TCP-friendly Adaptation Scheme," in Proc International Workshop 
on Network and Operating System Support for Digital Audio and Video 
(NOSSDAV) July \m. 

[7] A. Gocl, D. Stecre, C, Pu, and J. Walpole, "Adaptive Resource Manage- 
ment Via Modular Feedback Control,*' Tech. Rep. 99-03, Oregm Gradu^ 
Institute, Computer Sdenoe and Enginieeiing, Jan. 1999. 

[8] W. Almesbetger, J. L. Boudec, and T. Ferrari, "Scalable Resource Reser- 
vation for the Intemct," in IEEE Conference on Protocob for Multimedia 
Systems -Multimedia Networking, "Nov, ]997. 

[9] R. Rajkumar, C. Lee, J. Lchoczky, and D. Siewiorek, "A Resource Al- 
location Model for QoS Management," in JSth IEEE Real-Time System 
Symposium, 1997. 

[10] I. Foster, C. Kesselman, C Lee, R. Lmdell, K. Nahxstedt, and A. Roy, **A 
Distributed Resource Management Architecture That Supports Advance 
Reservations and Co-Allocation," in International Workshop on Quality of 
&rwce, pp. 27-36, June 1999. 



[1 1] 1. Foster, A. Roy, V. Sander, and L. Winkler, "End-to-End Quality of Ser- 
vice for High-End Applications," tech. rep., Argonnc National Laboratory, 
1999. hCtp : //vrww . mcs . anl . gov/qoB/qos.papers . htta. 

[]2] L Foster and C. Kesselman, eds.. The Grid: Blueprint for a Future Cbm- 
pu/mg //!j^/nicrure. Morgan Kaufhiann Publishers, 19^ 

[13] I. Foster, J. Insley, G. von Laszewski, C. Kessefanan, and*M. Thiebaux, 
"Distance Visualization: Data Exploration on the Grid,** IEEE Computer 
Magazine, pp. 36-43, Dec. 1999. 

[14] R. L. Riblcr, J. S. Vcttcr, H, Simitci. and D. A. Reed, "Autopilot Adaptive 
Control of Distributed Applications," in Proc 7th IEEE Symp. on High 
Performance Distributed Computing, IEEE Computer Society Press, 1998. 

[IS] K. Nichols, V. Jacobson^ and.L. Zhang, **A Two-Bit Diffeientiated Ser- 
vices Architecture for the Internet,** Internet RFC 2638, July 1 999. 

[16] S. Blake, D. Black, M. Carison, M. Davies. Z. Wang, and W. Weiss, "An ' 
Architecture for Differentiated Services,** Internet RFC 2475, 1998. 

[17] W. Feng, D. Kandlur, D. Saha, and IC Shin, "Understandmg and Improv- 
ing TCP Performance Over Networks with Minimum Rate Guarantees,** 
IEEE/ACM Transactions on Networking, voL.7, pp. 175-187, Apr. 1999. 

[18] 1. Ycom and A. N. Reddy, "Realizing Throughput Guarantees in A DifTer- 
qitiated Services Network," in IEEE Iiit. Conf on Multimedia Computing 
and Systems, pp. 372-376; June 1 999. 

[19] K. Lakshman and R. Yavatkar, ''Integrated CPU and'Network 1/0 QoS 
Management in an End-System,** Intel Architecture Labs and University of 
Kentucky in Computer Communications Journal Special Issue on Quality 
of Service in Distributed Systems, vol, 2 1 , Apr. 1 997. 

[20] F. Siqueira and V. Cahill, "Delivering QoS in Open Distributed Systems," 
in Proceedings of the 7th IEEE Workshop on Future Trends in Distributed 
Computing Systems (FTDCS'99), Dec. 1999. 



The Physiology of the Grid 

An Open Grid Services Architecture for Distributed Systems Integration 



Ian Foster"*^ Carl Kesselman^ Jeffrey M. Nick"* Steven Tuecke^ 

Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439 

^ Department of Computer Science, University of Chicago, Chicago, IL 60637 
^ Information Sciences Institute, University of Southern California, Marina del Rey, CA 90292 
* IBM Corporation, Poughkeepsie, Ky 12601 

foster@incs.anl.gov carl@isi.edu jnick@us.ibm.com tuecke@mcs.anl.gov 



Abstract 

In both e-business and e-science, we often need to integrate services across distributed, . 
• heterogeneous, dynamic "virtual organizations" formed from the disparate resources within a 
single enterprise and/or from external resource sharing and service provider relationships. This 
integration can be technically challenging because of the need to achieve various qualities of 
service when nmning on top of different native platforms. We present an Open Grid Services 
Architecture that addresses these challenges. Building on concepts and technologies from the. 
Grid and Web services conunuiuties, this architecture defines a unifoim exposed service 
semantics (the Gridservice)^ defines standard mechanisms for creatini^ naming, and discpvcQZlg. . 
transient Gnd service instancesyjroyides location transparency and multiple protocol bindings 
for service instances; and supports integration y/iih underlying native platform facilities. The 
Open Grid Services Architecture also defines, in terms of Web Sendees Description .Language 
(WSDL) interfaces and associated conventions, mechanisms required for creating and composing 
sophisticated distributed systems, including lifetime management, change management, and 
notification. Service bindings can support reliable invocation, authentication, authorization, and 
delegation, if required. Our presentation complements an earlier foundational article, *The 
Anatomy of the Grid," by describing how Grid mechanisms can implement a service-oriented 
architecture, explaining how Grid functionality can be incorporated into a Web services 
framework, and illustrating how our architecture can be applied within commercial computing as 
a basis for distributed system integration — within and across organizational domains. 

This is a DRAFT document and continues to be revised. The latest version can be 
found at http://www.globu$.org/research/papers/ogsa.pdf. Please send comments to 
foster@mcs^nLgov, carl@isi.edu, jnick@us.ibm.com, tuecke@mcs.anLgov 
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1 Introduction 

Until recently, application developers could often assume a target environment that was (to a 
useful extent) homogeneous, reliable, secure, and centrally managed. Increasingly, however, 
computing is concerned with collaboration, data sharing, and other new modes of interaction that 
involve distributed resources. The result is an increased ifocus on the interconnection of systems 
both within and across enterprises, whether in the form of intelligent networks, switching devices, 
. caching services, appliance servers, storage systems, or storage area network management 
systems. In addition, companies are realizing that they can achieve significant cost savings by 
outsourcing nonessential elements of their IT enviroimient to. various forms of service providers. 

These evolutionaiy pressures generate new requirements for distributed application development 
and deployment. Today, applications and middleware are typically developed for a specific 
platform (e.g., Windows NT, a flavor of Unix, a mainframe, J2EE, Microsoft .NET) that provides 
a hosting environihent for rimning applications. The capabilities provided by such platforms may 
range from integrated resource management functions to database integration, clustering services, 
security, workload management; and problem determination — ^witb diflferent implementations,, 
semantic behaviors, and APIs for these functions on different platforms. But in spite of this 
diversity, the continuing decentralization and distribution of software, hardware, and human 
resources make it essential that we achieve desired qualities of service (OoS)— whether measured 
in terms of conmion secxuity semantics, distribute7'woridlow aS3 resource management^, 
performance, coordinated fail-over, problem determination services, or other metrics — on 
resources assembled dynamically from enterprise systems, service provider systems, and 
customer systems. We require new abstractions and concepts that allow applications to access and 
share resources and services across distributed, wide area networks. 

Such problems have been for some time a central concern of the. developers of distributed 
systems for large-scale scientific research. Work within this conamimity has led to the • 
development of Grid technologies [30, 34], which address precisely these problems and which 
are seeing widespread and successfiil adoption for sciratific and technical computing. 

In an earher article, we defined Grid technologies and infrastructures as supporting the sharing 
and coordinated use of diverse resources in dynamic, distributed "virtual organizations" (VOs) 
[34]. We defined essential properties of Grids and introduced key requirements for protocols and 
services, distinguishing among connectivity protocols concerned with communication and 
authentication, re^owrce protocols concemedwith negotiating access to individual resources, and 
collective protocols and services concerned with the coordinated use of multiple resources. We 
^also described tEe Globus Toolkit™ ^ [29], an open source reference implementation of key Grid 
protocols that supports a wide variety of major e-science projects. 

Here we extend this argument in three respects to define more, precisely how a Grid functions and 
how Grid technologies can be implemented and applied. First, while [34] was structured in terms 
of the protocols required for interoperability among VO components, we focus here on the nature 
of the services that respond to protocol messages. We. view a Grid as an extensible set of Grid 
services that may be aggregated in various ways to meet the needs of VOs, which themselves can 
be defmed in part by the services that they operate and share. We then define the behaviors that 
such Grid services should possess in order to support distributed systems integration. By stressing 
functionaHty (i.e., "physiology"), this view of Grids complements the previous protocol-oriented 
C*anatomicar') description. 



'Globus Project and Globus Toolkit are trademarks of the University of Chicago. 
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Second, we explain how Grid technologies can be aligned with Web services technologies [40, 
47] to capitalize on desirable Web services properties, such as service description and discovery; 
automatic generation of client and server code from service descriptions; binding of service 
descriptions to interoperable network protocols; compatibility with emerging higher-level open 
standards, services and tools; and broad commercial support; We call this alignment— and 
augmentation— of Grid and Web services technologies an Open Grid Services Architecture 
(OGSA), with the tenn architecture denoting here a well-defined set of basic interfaces from 
which can be constructed interesting systems, and open being used to conmunicate extensibility, 
vendor neutrality, and commitment to a community standardization process. This architecture 
uses the Web Services Description Language (WSDL) to achieve self-describing, discoverable 
services and interoperable protocols, with extensions to support multiple coordinated interfaces 
and change management. OGSA leverages experience gained with the Globus Toolkit to define 
conventions and WSDL interfaces for a Grid service, a (potentially transient) statefiil service 
instance supporting reliable and secure invocation (when required), lifetime management, 
notification, policy management, credential management, and virtualization. OGSA also defmes 
interfaces for the discovery of Grid service instances and for the creation of transient Grid service 
instances. The result is a standards-based distributed service system (we avoid the tenn 
distributed object system due to its overioaded meaning) that supports the creation of the 
sophisticated distributed services required in modem enterprise and interorganizational 
computing environments. 

Third, we focus our discussion on commercial applications rather than the scientific and technical 
applications emphasized in [30, 34]. We believe that the same principles and mechanisms apply 
in both environments! However, in commercial settings we need, in particular, seamless 
integration with existing resources and applications, and with tools for workload, resource, 
security, network QoS, and availability management. OGS A's support for the discovery of 
service properties facilitates the mapping or adaptation of higher-level Grid service functions to 
such native platform facilities. OGSA's service orientation also allows us to virtualize resources 
at multiple levels, so that the same abstractions and mechanisms can be used both within 
distributed Grids supporting collaboration across organizational domains and within hosting 
environments spanning multiple tiers within a single IT domain. A common infirastructure means 
that differences (e.g., relating to visibility and accessibility) derive from policy controls 
associated with resource ovwiership, privacy, and security, rather than interaction mechanisms. 
Hence, as today's enterprise systems are transformed from separate computing resoiffce islands to 
integrated, multitiered distributed systems, service components can be integrated dynamically and 
flexibly, both within and across various organizational boundaries. 

The rest of this article is as follows. In Section 2, we examine the issues that motivate the use of 
Grid technologies in commercial settings. In Section 3, we review the Globus Toolkit and Web 
services, and in Section 4, we motivate and introduce our Open Grid Services Architecture. In 
Sections 5-8, we present an example and discuss protocol implementations and higher-level 
services. We discuss related work in Section 9 and summarize our discussion in Section 10. 

We emphasize that the Open Grid Services Architecture and associated Grid service 
specifications continue to evolve as a result of both standards work within die Global Grid Forum 
and implementation work within the Globus Project and elsewhere. Thus the technical content m 
this article, and iii an earlier abbreviated presentation [32], represent only a snapshot of a work in 
progress. 
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2 The Need for Grid Technologies 

Grid technologies support the sharing and coordinated use of diverse resources in dynamic 
VOs — that is, the creation, from geographically and organizationally distributed components, of 
virtual computing systems that are sufficiently integrated to deliver desired QoS [34]. 

Grid concepts and technologies were first developed to enable resource sharing within far-flung 
scientific collaborations [18, 19, 28, 30, 46, 64]. AppUcations include collaborative visualization 
of large scientific datasets (pooling of expertise), distributed computing for computationally 
demanding data analyses (pooling of compute power and storage), and coupling of scientific 
instruments with remote computers and archives (increasing fimctionality as well as availability) 
[45]. We expect similar applications to become important in commercial settings, initially for 
scientific and technical computing applications (where we can already point to success stories) 
and then for commercial distributed computing applications, incliiding enterprise application 
integration and business to business (B2B) partner collaboration over the. Internet Just as the • 
World Wide Web began as a technology for scientific collaboration and was adopted for e- 
business, We expect a similar trajectory for Grid technologies. 

Nevertheless, we argue that Grid concepts are critically important for commercial computing not 
primarily as a means of enhancing capability, but rather as a solution to new challenges relating 
to the construction of reliable, scalable, and secure distributed systems. These challenges derive 
from the current rush, driveii by technology trends and commercial pressxu*es, to decompose and 
distribute through the network previously monolithic host-centric services, as we now discuss. 

2.1 The Evolution of Enterprise Computing 

In the past, computing typically was performed within highly integrated host-centric enterprise 
cpmputing centers. While sophisticated distributed systems (e.g., command and control systems, 
reservation systems, the Internet Domain Name System [52]) existed, these have remained 
specialized, niche entities [9, 54]. 

The rise of the Intemet and the emergence of e-business have, however, led to a growing 
awareness that an enterprise's IT infi^tructure also encompasses external networks, resources, 
and services. Initially, tfiis new source of complexity was treated as a network-centric 
phenomenon and attempts were made to construct "intelligent networks" that intersect with 
traditional enterprise IT data centers only at "edge servers": for example, an enterprise's Web 
point of presence, or the virtual private network server that connects an enterprise network to 
service provider resources. The assumption was that the impact of e-business and the Intemet on- 
an enterprise's core IT infrastructure could thus be managed and circumscribed. 

This attempt has, in general, failed because IT services decomposition is also occurring inside 
enterprise IT facilities. New applications are being developed to programming models (such as 
the Enterprise Java Beans component model [65]) that insulate the application from the 
underlying computing platform and support portable deployment across multiple platforms. This 
portability in turn allows platforms to be selected on the basis of price/performance and QoS 
requirements, rather than operating system supported. Thus, for example, Web serving and 
caching applications target commodity servers rather than traditional mainfi-ame computing 
platforms. The resulting proliferation of Unix and NT servers necessitates distributed connections 
to legacy mainframe application and data assets. Increased load on those assets has caused 
companies to off-load nonessential fiinctions (such as query processing) from back-end 
transaction processing systems to mid-tier servers. Meanwhile, Web access to enterprise 
resources requires ever-faster request servicing, fiirther driving the need to distribute and cache 
content closer to the edge of the network. The overall result is a decomposition of highly 
integrated internal IT infrastructure into a collection of heterogeneous and Augmented systenas. . 
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Enterprises must then reintegrate (with QoS) these distributed servers and data resources, 
addressing issues of navigation, distributed security, and content distribution inside the enterprise, 
much as on external networks. 

In parallel with these developments, enteiprises are engaging ever more aggressively in e- 
business and are realizing that a highly robust IT infrastructure is required to handle the 
associated unpredictability and rapid growth. Enterprises are also now expanding the scope and 
scale of their enterprise resource planning projects as they try to provide better integration with ' 
customer relationship management, integrated supply chain, and existing core systems. These 
developments are adding to the significant pressures on the enterprise IT infrastructure. 

The aggregate effect is that qualities of service traditionally associated with mainframe host- 
centric computing [56] are now essential to the effective conduct of e-business across distributed 
compute resources, inside as well as outside the enterprise. For example, enterprises must 
provide consistent response times to customers, despite workloads with significant deviations 
between average and peak utilization. Thus, they require flexible resource allocation in 
accordance with workload demands and priorities. Enterprises must also provide a secure and 
reliable enviroimient for distributed transactions flowing across a collection of dissimilar servers, 
must deliver continuous availability as seen by end-users, and must support disaster recovery for 
business workflow across a distributed network of application and data servers. Yet the current 
paradigm for delivering QpS to applications via the vertical integration of platform-specific 
components and services just does not work in today's distributed environment: the 
decomposition of monolithic IT infrastructures is not consistent with the delivery of QoS through 
vertical integration of services on a given platform. Nor are distributed resource management 
capabilities effective, being limited by their proprietary nature, inaccessibility to platforiSS 
resources, and inconsistencies between similar resources across a distributed environment. 

The result of these trends is that IT systems integrators take on the burden of re-integrating 
distributed compute resources with respect to overall QoS. However, without appropriate 
infrastructure tools, the management of distributed computing workflow becomes mcreasingly 
labor-intensive, complex, and fragile as platform-specific operations staff watch for "fires" in 
overall availability and performance and verbally collaborate on corrective actions across 
different platforms. This situation is not scalable, cost-effective, or tenable in the face of changes 
to the computing environment and application portfolio. 

2.2 Service Providers and Business-tChBusiness Computing 

Another key trend is the emergence of service providers (SPs) of various types, such as web- 
hosting SPs, content distribution SPs, applications SPs, and storage SPs. By exploiting economies 
of scale, SPs aim to take standard e-business processes, such as creation of a web-portal presence, 
and provide them to multiple customers with superior price/performance. Even traditional 
enterprises with their own IT infrastructures are offloading such processes because they are 
viewed as commodity functions. 

Such emerging "eUtilities" (a term used to refer to service providers offering continuous, on- 
demand access) are beginning to offer a model for carrier-grade IT resource delivery through 
metered usage and subscription services. Unlike the computing services companies of the past, 
which tended to provide offline batch-oriented processes, resources provided by eUtilities are 
ofren tightly integrated with of enterprise computing infrastructures and used for business 
processes that span both in-house and outsourced resources. Thus, a price of exploiting the 
economies of scale that are enabled by eUtility structures is a further decorhposition and 
distribution of enterprise computing functions. EUtilities providers face their own technical 
challenges. To achieve economies of scale, eUtility providers require server infrastructures that 
can be easily customized on demand to meet specific customer needs. Thus, there is a demand for 
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IT infrastructure that (1) supports dynamic resource allocation in accordance with service-level 
agreement policies, efficient sharing and reuse of IT infrastructure at high utilization levels, and 
distributed security from edge of network to application and data servers and (2) delivers 
consistent response times and high levels of availability — ^^ybich in turn drives a need for end-to- 
end performance monitoring and real-time reconfiguration. 

Still another key IT industry trend is cross-enterprise business-to-business (B2B) collaboration 
such as multi-organization supply chain management, virtual web malls, and. electronic market 
auctions. B2B relationships are, in effect, virtual organizations, as defined above — albeit with 
particularly stringent requirements for security, auditability, availability, service level 
agreements, and complex transaction processing flows. Thus, B2B computing represents another 
source of demand for distributed systems integration, characterized by often large differences 
among the information technologies deployed within different organizations. 

3 Background 

We review two technologies on which we build to define the Open Grid Services Architecture: 
the Globus Toolkit, which has been widely adopted as a Grid technology solution for scientific 
and technical computing, and Web services, which have emerged, as a popular standards-based 
firework for accessing network applications. 

3.1 The Globus Toolkit 

The Globus Toolkit [29, 34] is a community-based, open-architecture, open-source set of services 
and software libraries that support Grids and Grid applications. The toolkit addresses issues of 
security, information discovery, resource management, data management, communication, fault 
detection, and portability. Globus Toolkit mechanismis are in use at hundreds of sites and by 
dozens of major Grid projects worldwide. 

The toolkit components that are most relevant to OGSA are the Grid Resource Allocation and 
Management (GRAM) protocol and its "gatekeeper" service, which provides for secure, reliable, 
service creation and management [25]; the Meta Directory Service (MDS-2) [24], which provides 
for information discovery through soft state registration [59, 69], data modeling, and a local 
registry ("GRAM reporter" [25]); and the Grid Security Infrastructure (GSI), which supports 
single sign on, delegation, and credential mapping. As illustrated in Figure 1, these components 
provide the essential elements of a service-oriented architecture, but with less generality than is 
achieved in OGSA. 



Register with discoveiy'Service 




Figure 1: Selected Globus Toolkit mechanisms, showing initial creation of a proxy credential and 
subsequent authenticated requests to a remote gatekeeper service, resulting in the creation of user 
process #2, with associated (potentially restricted) proxy credential, followed by a request to another 
remote service. Also shown is soft-state service registration via MDS-2. 
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The GRAM protocol provides for the reliable, secure remote creation and management o f 
arbitrary computations: what we term in this articl e /ro/i^/e;?/ service instances . GSl mechanisms 
are used for authentication, authorization, and credential delegation [38] to remote computations. 
A t wo-phase commit protocol is used for reliable invocation^ based on techniques used in the 
Condor system [50]. Service creation is handled by a small trusted "gatekeeper^' process (termed. 
2l factory in this article), while a ORAM reporter monitors and publishes information about the 
identity and state of local computations^(r^w 

MDS-2 [24] provides a unifonn framework for discovering and accessing system configuration 
and status information such as compute server configuration, network status, or the Ickrations of . 
replicated datasets (what we term a discovery interface in this article). MDS-2 uses a soft-state 
protocol, the Grid Notification Protocol [44], .for lifetime management of published information. 

The pub.lic-key-based Grid Security Infrastructure (GSl) protocol [33] provides single sign-on 
authentication, communication protection, and some initial support for restricted dekgation. In 
brief, single sign-on allows a user to authenticate once and thus create a proxy credential that a 
program ciEin use to authenticate with any remote service on the user's behalf Delegation allows 
for the creation and communication to a remote service of delegated proxy credentials that the 
remote service can use to act on the user's behalf, perhaps with various restrictions; this 
capability is important for nested operations. (Similar mechanisms can be implemented within the 
context of other security technologies, such as Kerberos [63], although with potentially different 
characteristics.) 

GSI uses X.509 certificates, a widely employed standard for PKI certificates, as the basis for user 
■ authentication. GSI defines an X.509 proxy certificate [67] to leverage X.509 for support of 
single sign-on and delegation. (This proxy certificate is similar in concept to a Kerberos 
forwardable ticket but is based purely on public key cryptographic techniques.) GSI typically uses 
the Transport Layer Security (TLS) protocol (the follow-on to SSL) for authentication, although 
• other public key-based authentication protocols could be used with X.509 proxy certificates. A . 
remote delegation protocol of X.509 proxy certificates is layered on top of TLS. An Internet 
Engineering Task Force draft defines the X.509 Proxy Certificate extensions [67]. Global Grid 
Forum drafts define the delegation protocol for remote creation of an X.509 Proxy Certificate 
[67] and GSS-API extensions that allow this API to be used effectively for Grid programming. 

' Rich support for restricted delegation has been demonstrated in prototypes and is a critical part of 
the proposed X.509 Proxy Certificate Profile [67]. Restricted delegation allows one entity to 
delegate just a subset of its total privileges to another entity. Such restriction is important to 
reduce the adverse effects of either intentional, or accidental misuse of the delegated credential. 

3-2 Web Services 

The term Web services describes an important emergiiig distributed computing paradigm that 
differs from other approaches such as DCE, CORBA, and Java RMI in its focus on simple, 
Internet-based standards (e.g., extensible Markup Language: XML [14, 27]) to address 
heterogeneous distributed computing. Web services define a technique for describing software 
components to be accessed, methods for accessing these components, and discovery methods that 
enable the identification of relevant service providers. Web services are programming language-^ 
programming model-, and system software-neutral. 

Web services standards are being defined within the W3C and other standards bodies and form 
the basis for major new industry initiatives such as Microsoft (.NET), IBM (Dynamic e- 
Business), and Sun (Sun ONE). We are particularly concerned with three of these standards: 
SOAP, WSDL, and WS-Inspection. . 
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• The Simple Object Access Protocol (SOAP) [4] provides a means of messaging between a 

. service provider and a service requestor. SOAP is a simple enveloping mechanism for XML 
payloads that defines a remote procedure call (RPC) convention and a messaging convention. 
SOAP is independent of the underlying transport protocol; SOAP payloads can be carried on 
HTTP, FTP, Java Messaging Service (JMS), and the like. We emphasize that Web services 
can describe multiple access mechanisms to the underlying software component. SOAP is 
just one means of formatting a Web service invocation: 

• the Web Services Description Language (WSDL) [22] is an XML document for describing 
. Web services as a set of endpoints operating on messages containing either document- 

• oriented (messaging) or RPC payloads.;Service interfaces are defined abstractly in terms of 
mesisage structures and sequences of simple message exchanges (or operations, in WSDL 
terminology) and then bound to a concrete network protocol and data-encoding format to 
define an endpoint. Related concrete.endpoints are bundled to define abstract endpoints 
(services). WSDL is extensible Xo allow description of endpoints and the concrete 
representation of their messages for a variety of different message formats and network 
protocols. Several standardized binding conventions are defined describing how to use 
WSDL in conjunction with SOAP 1.1, HTTP GET/POST, and MIME. 

• WS-Inspection [15] comprises a simple XML language and related conventions for locating 
service descriptions published by a service provider. A WS-Inspection language (WSIL) . 
document can contain a collection of service descriptions and links to other sources of service 
descriptions. A service description is usually a URL to a WSDL document; occasionally, a 
service description can be a reference to an entry within a Universal Description, Discovery, 
and Integration (UDDI) [5] registry. A link is usually a URL to another WS-Inspection 
document; occasionally, a link is a reference to a UDDI entry. With WS-Inspection, a service 
provider creates a WSIL document and makes the document network accessible. Service 
requestors use standard Web-based access mechanisms (e.g., HTTP GET) to retrieve this 
document and discover what services the service provider advertises. WSIL documents can 
also be organized in diifFerent forms of index. 

Various other Web services standards have been or are being defined. For example, Web Services 
Flow Language (WSFL) [6] addresses Web services orchestration^ that is, the building of 
sophisticated Web services by composing simpler Web services. 

The Web services fimnework has two advantages for our purposes. First, our need to support the 
dynamic discovery and composition of services in heterogeneous environments necessitates 
mechanisms for registering and discovering interface definitions and endpoint implementation 
descriptions and for dynamically generating proxies based on (potentially multiple) bindings for 
specific interfaces. WSDL supports this requirement by providing a standard mechanism for 
defining interface definitions separately from their embodiment within a particular binding 
(transport protocol and data encoding format). Second, the widespread adoption of Web services 
mechanisms means that a framework based on Web services can exploit nimierous tools and 
extant services, such as WSDL processors that can generate language bindings for a variety of 
languages (e.g., Web Services Invocation Framework: WSIF [53]), workflow systems that sit on 
top of WSDL, and hosting environments for Web services (e.g., Microsoft .NET and Apache 
Axis). We emphasize that the use of Web services does not imply the use of SOAP for all 
communications. If needed, alternative transports can be used, for example to achieve higher 
performance or to run over specialized network protocols. . 
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4 An Open Grid Services Architecture 

We have argued that within internal enterprise IT infrastnictures, SP-enhanced IT infrastructures, 
and multi-organizational Grids, computing is increasingly concerned with the creation, 
management, and application of dynamic ensembles of resources and services (and people)— . 
what we call virtual organizations [34], Depending on context, these ensembles can be small or 
large, short-lived or long-lived, single institutional or multi-institutional, and homogeneous or 
heterogeneous. Individual ensembles may be structured hierarchically from smaller systems and 
may overlap in membership. 

We assert that regardless of these differences, developers of applications for VOs face common 
requirements as they seek to deliver QoS — ^whether measured in terms of common security 
semantics, distributed workflow and resource management, coordinated fail-over, problem 
detennination services, or odier metrics — across a collection of resources with heterogeneous and 
often dynanwc characteristics. 

We now tiim to the nature of these requirements and the mechanisms required to address them in 
practical settings. Extending our analysis in [34], we introduce an Open Grid Services * 
Architecture that supports the creation, maintenance, and application of ensembles of services 
maintained by VOs. 

We start our discussion with some general remarks concerning the utility of a service-oriented 
Grid architecture, the importance of being able to virtualize Grid services, and essential service 
characteristics. Then, we introduce the specific aspects that we standardize in our definition of 
what we call a Grid service. We present more technical details in Section 6 (and in [66]). 

4,1 Service Orientation and VIrtualization 

When describing VOs^ we can focus on the physical resources being shared (as in [34]) or on the 
services supported by these resources. (A service is a network-enabled entity that provides some 
capability. The terra object could arguably also be used, but we avoid that term due to its 
overloaded meaning.) In OGSA, we focus on services: computational resources, storage 
resources, networks, programs, databases, and the like are all represented as services. 

Regardless of our perspective, a critical requirement in a distributed, multiorganizational .Grid 
environment is for mechanisms that enable interoperability [34]. In a service-oriented view, we 
can partition the interoperability problem into two subproblems, namely the definition of service 
interfaces and the identification of the protocol(s) that can be used to invoke a particular . 
interface — and, ideailly, agreement on a standard set of such protocols. 

A service-oriented view allows us to address the need for standard interface definition 
mechanisms, local/remote transparency, adaptation to local OS services, and uniform service 
semantics. A service-oriented view also simplifies virtualization — that is, the encapsulation 
behind a common interface of diverse implementations. Virtualization allows for consistent 
resource access across multiple heterogeneous platforms with local or remote location 
transparency, and enables mapping of multiple logical resource instances onto the same physical 
resource and management of resources within a VQ based on composition from lower-level 
resources. Virtualization allows the composition of services to form more sophisticated 
services — without regard for how the services being composed are implemented. Virtualization of 
Grid services also underpins the ability to map' common service semantic behavior seamlessly 
onto native platform facilities; 

Virtualization is easier if service functions can be expressed in a standard fonn, so that any 
implementation of a service is invoked in the same manner. WSDL, which we adopt for this 
purpose, supports a service interface definition that is distinct from the protocol bindings used for 
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service invocation, WSDL allows for multiple bindings for a single interface, including 
distributed conununication protocol(s} (e.g., HTTP) as well as locally optimized binding(s) (e.g., 
local IPQ for interactions between request and service processes on the same host. Other binding 
properties may include reliability (and other forms of QoS) as well as authentication and 
delegation of credentials. The choice of binding should always be transparent to the requesitor 
with respect to service invocation semantics — ^but not with respect to other things: for example, a 
requestor should be able to choose a particular binding for performance reasons. 

The service interface definition and access binding are also distinct from the implementation of 
the functionality of Uie service. A service can support multiple implementations on different 
platforms, facilitating seamless overlay not only to native platform facilities but also, via the 
nesting of service implementations, to virtual ensembles of resources. Depending on the platform 
and context, we might use the following implementation approaches. 

1 . We can use a reference implementation constructed for full portability across multiple 
platforms to support the execution environment (container) for hosting a service. . 

. 2. On a platform possessing specialized native facilities for delivering service functionality, 
we might map from the service interface definition to the native platform facilities. 

3. We can also apply these mechanisms recursively so that a higher-level service is 

constructed by the composition of multiple lower-level services, which themselves may 
either map to native facilities or decompose further. The service implementation then 
dispatches operations to lower-level services (see also Section 4.4) 

As an example, consider a distributed trace facility that records trace records to. a repository. On a 
platform that does not support a robust trace facility, a reference implementation can be created 
and hosted in a service execution environment for storing and retrieving trace records on demand. 
. On a platform already possessing a robust trace facility, however, we can integrate the distributed 
trace service capability with the native platform trace mechanism, thus leveraging existing 
operational trace management tools, auxiliary offload, dump/restore, and the like, while . 
semantically preserving the logical trace stream through the distributed trace service. Finally, in 
the case of a higher-level service, trace records obtained from lower-level services would be 
combined and presented as the integrated trace facility for the service. 

Central to this virtualization of resource behaviors is the ability to adapt to operating system 
functions on specific hosts. A significant challenge when developing these mappings is to enable 
exploitation of native capabilities — ^whether concerned with performance monitoring, workload 
management, problem determination, or enforcement of native platform seciuity policy — so that 
the Grid environment does not become the least common denominator of its constituent pieces. . 
Grid service discovery mechanisms are important in this regard, allowing higher-level services to 
discover what capabilities are supported by a particular implementation of an interface. For 
example, if a native platform supports reservation capabilities, an implenientation of a resource 
management interface (e.g., GRAM [25, 31]) can exploit those capabilities. 

Thus, our service architecture supports local and remote transparency with respect to service 
location and invocation. It also provides for multiple protocol bindings to facilitate localized 
optimization of services invocation when the service is hosted locally with the service requestor, 
as well as to enable protocol negotiation for network flows across organizational boundaries 
where we may wish to choose between several inteiGrid protocols, each optimized for a.different 
purpose. Finally, we note that an implementation of a particular Grid service interface may map 
to native, nondistributed, platform functions and capabilities: 
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4.2 Service Semantics: The Grid Service 

Our ability to virtualize and compose services depends on more than standard interface 
definitions. We also require standard semantics for service interactions so that, for example, 
different services follow the same conventions for error notification. To this end, OGSA defines 
what we call a Grid service: a Web service that provides a set of well-defined interfaces and that 
. follows specific conventions. The interfaces address discovery, dynamic service creation, lifetime 
management, notification, and manageability; the conventions address naming and 
upgradeability. We expect also to address authorization and concurrency control as OGSA 
evolves: Two other important issues, authentication and reliable invocation, are viewed as service 
protocol bindings and are thus external to the core Grid service definition, but must be addressed 
within a complete OGSA implementation. This separation of concerns increases the generality of 
the architecture without compromising functionality. 

The interfaces and conventions that define a Grid service are concerned, in particular, with 
behaviors related to the management of transient service instances, VO participants typically 
maintain not merely a static set of persistent services that handle complex activity requests fi-om 
clients. They often need to instantiate new transient service instances dynamically, which then 
handle the management and interactions associated with the state of particular requested 
activities. When the activity's state is no longer needed, the service can be destroyed. For 
example, in a videoconferencing system, the establishment of a videoconferencing session might 
involve the creation of service instances at intermediate points to manage end-to-end data flows 
, according to QoS constraints. Or, in a Web serving enviromhent, service instances miight be 
instantiated dynamically to provide for consistent user response time by managing application 
workload through dynamically added capacity. Other examples of transient service instances 
might be a query against a database, a data mining operation, a network bandwidth allocation, a 
running data transfer, and an advance reservation for processing capability. (These examples 
emphasize that service iiistances can be extremely lightweight entities, created to manage even 
short-lived activities.) Transience has significant implications for how services are managed, 
named, discovered, and used. 

4.2.1 Upgradeability Conventions and Transport Protocols 

Services within a complex distributed system must be independently upgradeable. Hence, 
versioning and compatibility between seryices must be managed and expressed so that clients can 
discover not only specific service versions but also compatible services. Further, services (and the 
hosting enviroimients in which they run) must be upgradeable without disrupting the operation of 
their clients. For exairq)le, an upgrade to the hosting environment may change the set of network 
protocols that can be used to communicate with the service, and an upgrade to the service itself 
may correct errors or even enhance the interface. Hence, OGSA defines conventions that allow us 
to identify when a service changes and when those changes are backwardly compatible with 
respect to interface and semantics (but not necessarily network protocol). OGSA also defines 
mechanisms for refi-eshing a client's knowledge of a service, such as what operations it supports 
or what network protocols can be used to communicate with the service. A service's description 
indicates the protocol binding(s) that can be used to communicate with the service. Two 
properties will often be desirable in such bindings. 

• Reliable service invocation. Services interact with one another by the exchange of 
messages. In distributed systems prone to component failure, however, one can never 
guarantee that a message has been delivered. The existence of internal state makes it 
important to be able to guarantee that a service has received a message either once or not 
at all. From this foundation one can build a broad range of higher-level per-operation 
semantics, such as transactions. 
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• Authentication, Authentication mechanisms allow the identity of individuals and services 
to be established for policy enforcement. Thus, one will often desire a transport protocol 
that provides for mutual authentication of client and service instance, as well as the 
delegation of proxy credentials. From this foundation one can build a broad range of 
higher-level authorization mechanisms. 

4.2.2 Standard Interfaces 

The interfaces (in WSDL terms, portTypes) that define a Grid service are listed in Table 1, 
introduced here, and described in more detail in Section 6 (and in [66]). Note that while OGSA 
defines a variety of behaviors and associated interfaces, all but one of these interfaces 
((/r/ifferv/ce) are optional; 



Table 1: Proposed OGSA Grid service interfaces (see text for details). The names provided here will 
likely change in the future. Interfaces for authorization, policy management, manageability, and 
likely other purposes remain to be defined. 



I on I ype 


uperanon 


Description 


GridSeryice 


FindServiceData 


Query a variety of information about the Grid service 

lUDlaUbv, lll^lUUlIlg Dodlt' llllXUopCCUUri JXIlUIlIlal lOIl 

(handle, reference, primary key, home handleMap: terms 
to be defined), richer per-interface information, and 
service-specific infonnation (e.g., service instances 
known to a registry). Extensible support for various 
query languages. 


SetTerminationTime 


Set (and get) termination time for Grid service instance 


Destroy 


Terminate Grid service iiistance 


Notification- 
Source 


Subscrib^To- 
NotificationTopic 


Subscribe to notifications of service-related events, 
based on message type and interest statement. Allows 
for delivery via third party messaging services." 


Notification- 
Sink 


DeliverNotification 


Carry out asynchronous delivery of notification 
messages 


Registry 


RegisterService 


Conduct sof\-state registration of Grid service handles 


UnregisterService 


Deregister a Grid service handle 


Factory 


CreateService 


Create new Grid service instance 


HandleMap 


FindByHandle 


Return Grid Service Reference cuirently associated with 
supplied Grid Service Handle 



Discovery. Applications require mechanisms for discovering available services and for 
determining the characteristics of those services so that they can configure themselves and their 
requests to those services appropriately. We address this requirement by defining 

• a standard representation for service data, that is, information about Grid service 

instances, which we structure as a set of named and typed XML elements called service 
. data encapsulated in a standard container format; 
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• a standard operation, FindServiceData (within the required GridService interface), for 
retrieving service data from individual Grid service instances ("pull" mode access; see 
the iVbr/y?cflno/iSoMrce interface below for "push" mode-access 

• standard interfaces for registering information about Grid service instances with registry 
services {Registry) and for mapping from "handles" to "references" (HandleMapr-^o be 
explained in Section 6, when we discuss naming). ' 

Dynamic service creation. The ability to dynamically create and manage new service instances is 
a basic tenet of the OGSA model and necessitates the existence of service creation services. The 
OGSA model defines a standard interface (Factory) and semantics that any service creation 
service must provide. 

Lifetime management. Any distributed system must be able to deal with inevitable failures. In a 
system that incorporates transient, stateful service instances, mechanisms must be provided for 
reclaiming services and state associated with failed operations. For example, termination of a 
videoconferencing session might also require the termination of services created at intermediate 
points to manage the flow. We address this requirement by defining two standard operations: 
Destroy and SetTerminationTime (within the required GridService mterface), for explicit 
destruction and soft state lifetime management of Grid service instances, respectively. {Soft state 
protocols [59, 69] allow state established at a remote location to be discarded eventually, unless 
. refreshed by a stream of subsequent "keepalivc" messages. Such protocols have the advantages of 
being both resilient to failure; — a single lost message need not cause irretrievable hann-^and 
simple: no reliable "discard" protocol message is required.) 

Notification. A collection of dynamic, distributed services must be able to.notify each other 
asynchronously of interesting changes to their state. OGSA defines common abstractions and 
service interfaces for subscription to {NotificatiqnSource) and delivery of (NotificationSink) such 
notifications, so that services constructed by the composition of simpler services can deal with 
notifications (e.g., for errors) in standard ways. The NotificationSource interface is integrated 
with service data, so that a notification request is expressed as a request for subsequent "push" . 
mode delivery of service data. (We might refer to the capabilities provided by these inteifaces as 
an event service [1 0], but we avoid that term due to its overloaded meaning.) 

Other interfaces^ We expect to define additional standard interfaces in the near future, to address . 
issues such as authorization, policy management, concurrency control, and the monitoring and 
nianagement of potentially large sets of Grid service instances. 

4.3 The Role of Hosting Environments 

OGSA defines the semantics of a Grid service instance: how it is created, how it is named, how 
its lifetime is determined, how to communicate with it, and so on. However, while OGSA is 
prescriptive on matters of basic behavior, it does not place requirements on what a service does or 
how it performs that service. In other words, OGSA does not address issues of implementation 
programming model, programming language, implementation tools, or execution environment. 

In practice, Grid services are instantiated within a specific execution environment or hosting 
environment. A particular hosting environment defines not only iniplementation programming 
model, programming language, development tools, and debugging tools, but also how an 
implementation of a Grid service meets its obligations with respect to Grid service semantics. 

Today's e-science Grid applications typically rely on native operating system processes as their . 
hosting environment, with for example creation of a new service instance involving the creation 
of a new process. In such environments, a service itself may be implemented in a variety of 
languages such as C, C++; Java, or Fortran. Grid semantics may be implemented directly as part 
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of the service, or provided via a linked library [39]. Typically semantics are not provided via 
external services, beyond those provided by the operating system. Thus, for example, lifetime 
management functions must addressed within the application itself, if required. 

Web services, on the other hand, may be implemented on more sophisticated container or 
component-based hosting environments such as J2EE, Websphere, .NET, and Sun One. Such 
environments define a framework (container) within which components adhering to environment- 
defined interface standards can be instantiated and composed to build complex applications. 
Compared with the low levels of functionality provided by native hosting environments, • 
container/component hosting environments tend to offer superior programmability, 
manageability, flexibility, and safety. Consequently, component/container based hosting 
environments are seeing widespread use for building e-business services. In the OGSA context, 
the container (hosting environment) has primary responsibility for ensuring that the services it 
supports adhere to Grid service semantics, and thus OGSA may motivate modifications or 
additions to the container/component interface. 

By defining service semantics, OGSA specifies interactions between services in a manner 
independent of any hosting environment. However, as the above discussion highlights, successful 
implementation of Grid services can be facilitated by specifying baseline characteristics that all . 
hosting environments must possess, defining the "internal" interface from the service 
implementation to the global Grid environment. These characteristics would then be rendered into 
different implementation technologies (e.g., J2EB or shared libraries). 

A detailed discussion of hosting eiivironment characteristics is beyond the scope of this article. 
However, we can expect a hosting environment to address mapping of Grid-wide names (i.e., 
Grid service handles) into implementation-specific entities (C pointers, Java object references, 
etc.); dispatch of Grid invocations and notification events into implementation-specific actions 
(events, procedure calls); protocol processing and the formatting of data for network ' 
transmission; lifetime management of Grid service instances; and inter-service authentication. 

4.4 Using OGSA Mechanisms to Buiid VO Structures 

Applications and users must be able to create transient services and to discover and determine the 
properties of available services. The OGSA Factory^ Registry, GridService, and HandleMap 
interfaces support the creation of transient service instances and the discovery and 
characterization of the service instances associated with a VO. (In effect, a registry service — a 
service instance that supports the Registry interface for registration and die GridService 
interface's FindServiceData operation, with appropriate service data, for discovery — defines the 
service set associated with a VO.) These interfaces can be used to construct a variety of VO 
service stmctures, as illustrated in Figure 2 and described in the following. 

Simple hosting environment: A simple execution environment is a set of resources located within 
a single administrative domain and supporting native facilities for service management: for 
example, a J2EE application server, Microsoft .NET system, or Linux cluster. In OGSA, the user 
interface to such an environment will typically be structured as a registry, one or more factories, 
and a handleMap service. Each factory is recorded in the registry, to enable cUents to discover 
available factories. When a factory receives a client request to create a Grid service instance, the 
factory invokes hosting-environment-specific capabilities to create the new instance, assigas it a 
handle, registers the instance with the registry,, and makes the handle available to the handleMap 
service. The implementations of these various services map directly into local operations. 
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Figure 2: Three different VO structures, as described In the text From left to right: simple hosting 
environment, virtual hosting environment, and collective services. 

Virtual hosting environment. In more complex enviromnents, the resources associated with a VO 
will span heterogeneous,, geographically distributed "hosting environments.*' (For example, in 
Figure 2, these resources span two simple hosting environments.) Nevertheless, this •'virtual 
hosting environment*' (which corresponds, perhaps, to the set of resources associated with a B2B 
partnership) can be made accessible to a client via exactly the same interfaces as wer^ used for 
the hosting environment just described. We create one or more "higher-level" factories that 
delegate creation requests to lower-level factories. Similarly, we create a higher-level registry that 
knows about the higher-level factories and the service instances that they have created, as well 
any VO-specific policies that govern the use of VO services. Clients can then use the VO registiy 
to find factories and other service instances associated with the VO, and then use the handles 
returned by the registry to talk directly to those service instances. The higher-level factories and 
registry implement standard interfaces and so, from the perspective of the user, are. 
indistinguishable from any other factory or registry. 

Note that here, as in the previous example, the registry handle can be used as a globally unique 
name for the service set maintained by the VO. Resource management policies can be defined 
aiid enforced on the platforms hosting VO services, targeting the VO by this unique name. 

Collective operations. We can also construct a "virtual hosting environment'* that provides VO 
participants with more sophisticated, virtual, "collective" or "endTto-end*' services. In this case, 
the registry keeps track of and advertises factories that create higher-level service instances. Such 
instances are implemented by asking lower-level factories to create multiple service instances and 
by composing the behaviors of those multiple lower-level service instances into that single, 
higher-level service instance. 

These three examples, and the preceding discussion, illustrate how Grid service mechanisms can 
be used to integrate distributed resources both across virtual multi-organizational boundaries and 
withb internal commercial IT infrastructures. In both cases, a collection of Grid services 
registered with appropriate discovery services can support functional capabilities delivering QoS 
interactions across distributed resource pools. Applications and middleware can exploit these 
services for distributed resource management across heterogeneous platforms with local and 
remote transparency and locally optimized flows. 

Implementations of Grid services that map to native platform resources and APIs enable seamless 
integration of higher-level Grid services such as those just described with underlying platfom 
components. Furthermore, service sets associated with multiple VOs can map to the same 
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underlying physical resources, with those services represented as logically distinct at one level 
but sharing physical resource systems at lower levels. 

5 Application Example 

We illustrate in Figure 3 the following stages in the life of a data mining computation, which we 
use to illustrate the woiking of basic remote service invocation, lifetime management, and 
notification functions. 

1 . The environment initially comprises (from left to right) four simple hosting 
environments: one that runs the user application; one that encapsulates computing and 
storage resources (and that supports two factory services, one for creating storage 
reservations and the other for creating mining services); and two that encapsulate 
database seirvices. The "R**s represent local registry services; an additional YO registry 
service presumably provides information about the location of all depicted services. 

2. The user application invokes "create Grid service" requests on the two factories in the 
second hosting envirotmient, requesting the creation of a "data mining service" that will 
perform the data mining operation on its behalf, arid an allocation of temporary storage . 
for use by that computation. Each request involves mutual authentication of the user and 
the relevant factory (using an authentication mechanism described in the factory's service 
description) followed by authorization of the request. Each request is successful and 
results in the creation of a Grid service instance with some initial lifetime. The new data 
mining service instance is also provided with delegated proxy credentials that allow it to 
perform further remote operations on behalf of the user. 

.3. The newly created data mining service uses its proxy credentials to start requesting data 
from the two database services, placing intermediate results in local storage! The data 
mining service also uses notification mechanisms to provide the user application with 
periodic updates on its status. Meanwhile, the user application generates periodic 
^'keepalive'* requests to the two Grid senrice instances that it has created. 

4. The user application fails for some reason. The data mining cornputation continues for 
now, but as no other party has. an interest in its results, no further keepalive messages are 
generated' 

5. (Not shown in figure) Due to the application failure, keepalive messages cease, and so the 
two Grid service instances eventually time out and are terminated, freeing the storage and 
computing resources that they were consuming. 
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Figure 3: An example of Grid services at work. See text for details. 

6 Technical Details 

We now present a more detailed description of the Grid service abstraction and associated 
interfaces and conventions. 

6.1 The OGSA Service Model 

A basic premise of OGSA is that everything is represented by a service: a network enabled entity 
that provides some capability through the exchange of messages. Computationalresourees, 
storage resources, networks, programs, databases, and so forth are all services. This adopUon of a 
uniform service-oriented model means that all components of the environment are virtual. 
More specifically. OGSA represents everything as a Grid service: a Web service that conforms to 
a set of conventions and supports standard interfaces for such purposes as .fetime managemem. 
This core set of consistent interfaces, from which aU Grid services are implemented, facilitates 
the construction of higher-order services that can be treated in a uniform way across layers of 
abstraction. 
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Grid services are characterized (typed) by the capabilities that they offer. A Grid service 
implements one or more interfaces^ where each interface defines a set of operations that are 
invoked by exchanging a defined sequence of messages. Grid service interfaces correspond to 
portTypes in WSDL. The set of portTypes supported by a Grid service, along with some 
additional information relating to versioning, are specified in the Grid service's serviceType, a 
WSDL extensibility element defined by OGS A. 

Grid services can maintain internal state for the lifetime of the service. The existence of state 
distinguishes one instance of a service from another that provides the same interface. We use the 
term Grid service instance to refer to a particular instantiation of a Grid service. 

The protocol binding associated with a service interface can define a delivery semantics that 
addresses, for example, reliability. Services interact with one another by the exchange of 
messages. In distributed systeins prone to component failure, however, one can never guarantee 
that a message that is sent has been delivered. The existence of internal state can make it 
important to be able to guarantee that a service has received a message once or not at all, even if 
failure recovery mechanisms such as retry are in use. In such situations, we may wish to use a 
protocol that guarantees exactly-once delivery or some similar semantics. Aiiother frequently 
desirable protocol binding behavior is mutual authentication during communication. 

OGSA services can be created and destroyed dynamically. Services may be destroyed explicitly, 
or may be destroyed or become inaccessible as a result of some system failure such as operating 
system crash or network partition. Interfaces are defined for managing service lifetime. 

Because Grid services are dynamic and statefiil, we need a way to distinguish one dynamically 
created service instance from another. Thus, every Grid service instance is assigned a globally 
unique name, the Grid service handle (GSH), that distinguishes a specific Grid service instance . 
from all other Grid service instances that have existed, exist now, or will exist in the future. (If a 
Grid service fails and is restarted in such as way as to preserve its state, then it is essentially the 
saine instance, and the same GSH can be used.) . . 

Grid services may be upgraded during their lifetime, for example to support new protocol 
versions or to add alternative protocols. Thus, the GSH carries no protocol- or instance-specific 
information such as network address and supported protocol bindings. Instead, this information is 
encapsulated, along with all other instance-specific information required to interact with a 
specific service instance, into a single abstraction called a Grid service reference (GSR). Unlike a 
GSH, which is invariant, the GSR(s) for a Grid service instance can change over that service's 
lifetime. A GSR has an explicit expiration time, or may become invalid at any time during a 
service's lifetime, and OGSA defines mapping mechanisms, described below, for. obtaining an 
updated GSR. 

The result of using a GSR whose lifetime has expired is undefined. Note that holding a valid GSR 
does not guarantee access to a Grid service instance: local policy or access control constraints (for 
example maximum niunber of current requests) may prohibit servicing a request. In addition, the 
referenced Grid service instance may have failed, preventing the use of the GSR. 

As everything in OGSA is a Grid service, there must be Grid services that manipulate the Grid 
service, handle, and reference abstractions that define the OGSA model. Defining a specific set of 
services would result in a specific rendering of the OGSA service model. We therefore take a 
more flexible approach and define a set of basic OGSA interfaces (i.e., WSDL portTypes) for 
manipulating service model abstractions. These interfaces can then be combined in different ways 
to produce a rich range of Grid services. Table 1 presents names and descriptions for the Grid 
service interfaces defined to date. Note that only the GridService interface must be supported by 
all Grid services. 
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6.2 Creating Transient Services: Factories 

OGSA defines a class of Grid services that implement an interface that creates new Grid service 
instances. We call this the Factory interface and a service that implements this interface ^factory. 
The Factory interface's CreateService operation creates a requested Grid service and returns the 
GSH and initial GSR for the new service instance. 

The Factory interface does not specify how the service instance is created. One common scenario 
is for the factory interface to be implemented in some form of hosting environment (such as .NET 
or J2EE) that provides standard mechanisms for creating (and subsequently managing) new 
service instances. The hosting environment may define how services are implemented (e.g., 
language), but this is transparent to service requestors in OGSA, which see only the factory 
interface. Alternatively, one can construct "higher-level" factories that create services by 
delegating the request to other factory servicies (see Section 4.4). For example, in a Web serving 
enviroimient, a new computer might be integrated into the active pool by asking an appropriate 
factory service to instantiate a "Web serving'' service on an idle computer. 

6.3 Service Lifetime Management 

The introduction of transient service instances raises the issue of determining the service's 
lifetime: that is, determining when a service can or should be terminated so tfiat associated 
resources can be recovered. In normal operating conditions, a transient service instance is created 
to perform a specific task arid either terminates on completion of this task or via an explicit 
request from the requestor or jfrom another service designated by the requestor. In distributed 
systems, however, components may fail and messages may be lost. One result is that a service 
may never see an expected explicit termination request, thus causing it to consume resources 
indefinitely. 

OGSA addresses this problem through a soft state approach [23, 69] in which Grid service 
instances are created with a specified lifetime. The initial lifetime can be extended by a specified 
time period by explicit request of the client or another Grid service acting on the client's behalf 
(subject of course to the policy of the service). If that time period expires without having received 
a re-affirmation of interest from a client, either the hosting enviroimient or the service instance 
itself is at liberty to terminate the service instance and release any associated resources. 

Our approach to Grid service lifetime management has two desirable properties: 

• A client knows, or can determine, when a Grid service instance will terminate. This 
knowledge aUows the client to determine reliably when a service instance has terminated 
and hence its resources have been recovered, even in the face of system faults (e.g., 
failures of servers, networks, clients). The client knows exactly how long it has in order 
to request a final status from the service instance or to request an extension to the 
service's lifetime. Moreover, it also knows that if system faults occur, it need not 
continue attempting to contact a service after a known termination time, and that any 
resources associated with that service would be released after that time — ^unless another 
client succeeded in extending the lifetime. In brief, lifetime management enables robust 
termination and failure detection, by clearly defining the lifetime semantics of a service 
instance. 

• A hosting environment is guaranteed that resource consumption is bounded, even in the 
face of system failures outside of its control. If the termination time of a service is 
reached, the hosting environment can reclaim all associated resources. 

We implement soft state lifetime management via the SetTerminationTime operation within the . 
required GridService interface, which defines operations for negotiating an initial lifetime for a 
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new service instance, for requesting a lifetime extension, and for harvesting a service instance 
when its lifetime has expired. We describe each of these mechanisms in turn. 

Negotiating an initial lifetime. When requesting the creation of a new Grid service instance 
through a factory, a client indicates minimum and maximum acceptable initial lifetimes. The 
factory selects an initial lifetime and returns this to the client. 

Requesting a lifetime extension. A client requests a lifetime extension via a SetTerminationTime 
. message to the Grid service instance, which specifies a minimum and maximum acceptable new 
lifetime. The service instance selects a new lifetime and returns this to the client. Note that a 
keepalive message is effectively idempotent: the result of a sequence of requests is the same, even 
if intermediate requests are lost or reordered, as long as not so many requests are lost that the 
service instance's lifetime expires. 

The periodicity of keepalive messages can be determined by the client based on the initial 
lifetime negotiated witih the service instance (and perhaps renegotiated via subsequent keepalive 
messages) and knowledge about network reliability. The interval size allows tradeoffe between 
currency of information and overhead. 

We note that this approach to lifetime management provides a service with considerable 
autonomy. Lifetiime extension requests from clients are not mandatory: the service can apply its . 
' own policies on granting such request. A service can decide at any time to extend its lifetime^ 
either in response to a lifetime extension request by a client or any other reason. A service 
instance can also cancel itself at any time, for example if resource constraints and priorities 
dictate that it relinquishes its resources. Subsequent client requests that refer to this service will 
fail. 

The. use of absolute time in the SetTerminationTime operation — and, for that matter, in Grid 
• service information elements, and conmionly in security credentials — unplies the existence of a 
global clock that is sufficiently well synchronized. The Network Time Protocol (NTP) provides 
standardized mechanisms for clock synchronization and can typically synchronize clocks within 
at most tens of milliseconds, which is more than adequate for the purposes of lifetime 
management Note that we are not implying by these statements a requirement for ordering of 
events, although we expect to introduce some such mechanisms in future revisions. 

6.4 Managing Handles and. References 

As discussed above, the result of a factory request is a GSH and a GSR. While the GSH is 
guaranteed to reference the created Grid service instance in perpetuity, the GSR is created with a 
finite lifetime and may change during the service's lifetime. While this strategy has the advantage 
of increased flexibility fi-om the perspective of the Grid service provider, it introduces the 
problem of obtaining a valid GSR once the GSR returned by the service creation operation 
expires. At its core, this is a bootstrapping problem: how does one establish communication with 
a Grid service given only its GSH? We describe here how these issues are addressed in the Grid 
service specification as of June 2002, but note that this part of the specification is likely to evolve 
in the future, at a minimum to support multiple handle representations and handle mapping 
services. 

The approach taken in OGSA is to define a handle-to-reference mapper interface (HandleMap). 
The operations provided by this interface take a GSH and return a vaUd GSR. Mapping . . 
operations can be access controlled and thus a mapping request may be denied. An 
implementation of the HandleMap interface may wish to keep track of what Grid service 
instances are actually in existence and not return references to instances that it knows have 
terminated. However, possession of a valid GSR does not assure that a Grid service instance can 
be contacted: the service may have failed or been explicitly terminated. between the time the GSR 
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was given out and the time that it was used. (Obviously, if termination of a service is scheduled, it 
is desirable to represent this in the GSR lifetime, but it is not required.) 

By introducing the HandleMap interface, we partition the general problem of obtaining a GSR for 
an arbitrary service into two more specific subproblems: 

1 ) Identifying a handleMap service that contains the mapping for the specified GSH, and 

2) Contacting that handleMap to obtain the desired GSR. 

We address these two subproblems in turn. To ensure that we can always map a GSH to a GSR, 
we require that every Grid service instance be registered with at least one handleMap, which we 
call the home handleMap. By structuring the GSH to include the home handleMap*s identity, we 
can easily and scalably determine which handleMap to contact to obtain a GSR for a given GSH. 
Hence, unique names can be determined locally, thus ayoiding scalability problems associated 
with centralized name allocation services — ^although relying on the Domain Name System [52]. 
Note that GSH mappings can also live in odier handleMaps. However, every GSH must have 
exactly one home handleMap. 

How do we identify the home handleMap within a GSH? Any service that iniplements the 
HandleMap interface is a Grid service, and as such will have a GSH. If we use this name in 
constructing a GSH, however, then we are back in the same position of trying to obtain a GSR 
from the handleMap service's GSH. To resolve this bootstirapping problem, we need a way. to 
obtain the GSR for the handleMap without requiring a handleMap! We accomplish this by 
requiring that all home handleMap services be identified by a URL and support a bootstrapping 
operation that is bound to a sinjgle, well-known protocol, namely, HTTP (or HTTPS). He&ce, 
instead of iising a GSR to describe what protocols should be used to contact the handleMap 
service, an HTTP GET operation is used on the URL that points to the home handleMap, and the . 
GSR for the handleMap, in WSDL fprin, is returned. 

Note that a relationship exists between services that implement the HandleMap and Factory 
interfaces. Specifically, the GSH returned by a factory request must contain the URL of the home 
handleMap, and the GSH/GSR mapping must be entered and updated into the handleMap service. 
The implementation of a factory must decide what service to use as the home handleMap. Indeed 
a single service may implement both the Factory and HandleMap interfaces. 

Current work within GGF is revising this Grid service component to allow for other forms of 
handles and mappers/resolvers and/or to sin4>lify the current handle and resolver. 

6.5 Service Data and Service Discovery 

Associated with each Grid service instance is a set of service data, a collection of XML elements 
encapsulated as service data elements. The packaging of each element includes a name that is 
unique to the Grid service instance, a type, and time-to-live information that a recipient can use 
for lifetime management. 

The obligatory GridService interface defines a standard WSDL operation, FindServiceData, for 
querying and retrieving service data. This operation requires a simple "by name" query language, 
and is extensible to allow for the specification of the query language used, which may be for 
example Xquery [20]. 

The Grid service specification defines for each Grid service interface a set of zero or more service 
data elements that must be supported by any Grid service instance that supports that interface. 
Associated with the GridService interface, and thus obligatory for any Grid service instance, are a 
set of elements containing basic information about a Grid service instance, such as its GSH, GSR, 
primary key, and home handleMap. 
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One application of the GridService interface's FindServiceData operation is service discovery. 
Our discussion above assumed that one has a GSH that represents a desired service. But how does 
one obtain the GSH in the first place? This is the essence service discovery^ which we define 
here as the process of identifying a subset of GSHs fi-om a specified set based on GSH attributes 
such as the interfaces provided, the number of requests that have been serviced, the load on the 
service, or'policy statements such as the number of outstanding requests allowed. 

A Grid service that supports service discovery is called a registry. A registry service is defined by 
two things: the Registry interface, which provides operations by which GSHs can be registered 
with the registry service, and an associated service data element used to contain information 
about registered GSHs. Thus, the Registry interface is used to register a GSH and the GridService 
interface's FindServiceData operation is used to retrieve information about registered GSHs. 

The Registry interface allows a GSH to register with a registry service to augment the set of 
GSHs that are considered for subsetting. As in MDS-2 [24], a service (or VO) can use this 
operation to notify interested parties within a VO of its existence and the service(s) that it 
provides. These interested parties typically include various forms of service discovery services, 
which collect and structure service information in order to respond efficiently to service 
discovery requests. As with other stateful interfaces in OGSA, GSH registration is a soft state 
operation and must be periodically refiresbed, thus allowing discovery services to deal naturally 
with dynamic service availability. 

We note that specification of the attributes associated with a GSH is not tied to the registration of 
a GSH to a service implementing the GridService interface. This feature is important because 
attribute values may be dynamic and there may be a variety of ways in which attribute values 
may be obtained, including consulting another service implementing the GridService interface. • 



6.6 NotiTication 

The OGSA notification framework allowsclients to register interest in being notified of particular 
messages (the NotificationSource interface) and supports asynchronous, one-way delivery of such 
notifications (NotificationSinky If a particular service wishes to support subscription of 
notification messages, it must support the NotificationSource interface to manage the 
subscriptions. A service that wishes to receive notification messages must implement the 
NotificationSink interface, which is used to deliver notification messages. To start notification 
fi-om a particular service, one invokes the subscribe operation on the notification source interface, 
giving it the service GSH of the notification sink. A stream of notification messages then fiow 
firom the source to the sink, while the sink sends periodic keepalive messages to notify the source 
that it is still interested in receiving notifications. If reliable delivery is desired, this behavior can 
be implemented by defining an appropriate protocol binding for this service. 

An important aspect of this notification model is a close integration with service data: a. 
subscription operation is just a request for subsequent "push'* delivery of service data that meet 
specified conditions. (Recall that the FindServiceData operation provides a "pull" model.) 

The framework allows both for direct service-to-service notification message delivery, and for 
integration with various third-party services, such as messaging services commonly used in the . 
commercial world, or custom services that filter, transform, or specially deliver notification 
messages on behalf of the notification source. Notification semantics are a property of the 
protocol binding used to deliver the message. For example, a SOAP/HTTP protocol or direct 
UDP binding would provide point-to-point, best-effort, notification, while other bindings (e.g., 
some proprietary message service) would provide better than best-effort delivery. A multicast 
protocol binding would support multiple receivers. 
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6J Change Management 

In order to support discovery and change management of Grid services, Grid service interfaces 
must be globally and uniquely named. In WSDL, an interface is defined by a portType and is 
globally and uniquely named by the portType's qname (i.e., an XML namespace as defined by 
the targetNamespace attribute in the WSDL document's defmitions element, and a local name 
defined by the portType element's name attribute). Any changes made to the definition of a Grid 
service, either by changing its interface or by making semantically significant implementation 
changes to the operations, must be reflected through new interface names (i.e., newportTypes 
and/or seryiceTypes). This feature allov^^s clients that require Grid Services with particular 
properties (either particular interfaces or implementation semantics) to discover compatible 
services. 

6.8 Other Interfaces 

We expect in the future to define an optional Manageability interface that supports a set of . 
manageability operations. Such operations allow potentially large sets of Grid service instances to 
be monitored and managed from management consoles, automation tools, and the like. An 
optional Concurrency interface will provide concurrency control operations. 

7 Network Protocol Bindings 

The Web services framework can be instantiated on a variety of different protocol bindings. 
SOAP+HTTP with TLS for security is one example, but others can and have been defined. Here 
we discuss some issues that arise in the OGS A context. 

In selecting network protocol bindings within an OGSA context, we must address four primary 
requirements: 

' . • Reliable transport. As discussed above, 'the Grid services abstraction can require support 
for reliable service invocation. One way to address this requirement is to incorporate 
appropriate support within the network protocol binding, as for example in HTTP-R. 

Authentication and delegation. As ^scuss^d above, the Giid sery/ices ahsUsiC^i^ 
require support for communication of proxy credentials to remote sites. One way to 
address this requirement is to incorporate appropriate support within the network protocol 
binding, as for example in TLS extended with proxy credential support. 

• Ubiquity. The Grid goal of enabling the dynamic formation of VOs from distributed 
resources means that, in principle, it must be possible for any arbitrary pair of services to 
interact. . 

• GSR Format, Recall that the Grid Service Reference can take a binding-specific fonmt. 
One possible GSR format is a WSDL document; CORBA lOR is another. 

The successful deployment of large-scale interoperable OGSA implementations would benefit 
from the definition of a small number of standard protocol bindings for Grid service discovery 
and invocation. Just as the ubiquitous deployment of the Internet Protocol allows essentially any 
two entities to communicate, so ubiquitous deployment of such "InterGrid" protocols will allow 
any two services to communicate. Hence, clients can be particularly simple, since they need to 
know about only one set of prptocols. (Notice that the definition of such standard protocols does 
not prevent a pair of services from using an alternative protocol, if both support it.) Whether or . 
not such InterGrid protocols can be defined and gain widespread acceptance remains to be seen. , 
In any case, their definition is beyond the scope of this article. 
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8 Higher-Level Services 

The abstractions and services described in this article provide building blocks that can be used to 
implement a variety of higher-level Grid services, We intend to work closely with the community 
to define and implement a wide variety of such services that will, collectively, address the diverse 
requirements of e-business and e-science applications. These are likely to include the following: 

• Distributed data management services, supporting access to and manipulation of 
distributed data, whether in databases or files [58]. Services of interest include database . 
access, data translation, replica management, replica location, and transactions. 

• fVorl^ow services, supporting the coordinated execution of multiple application tasks on 
multiple distributed Grid resources. 

• Auditing services, supporting the recording of usage data, secure storage of that data, 
analysis of that data for purposes of fi^ud and intrusion detection, and so forth. 

• Instrumentation and monitoring services, supporting the discovery of "sensors" in a 
distributed environment, the collection and analysis of information from these sensors, 
the generation of alerts when unusual conditions are detected, and so forth. 

• Problem determination services for distributed computing, including dump, trace, and log 
mechanisms with event tagging and correlation capabilities. 

• Security protocol mapping services, enabling distributed security protocols to be 
transparently mapped onto native platform security services for participation by platform 
resource managers not implemented to support the distributed security authentication and 
access control mecbanisnL 

The flexibility of our framework means that such services can be implemented and composed in a 
variety of different ways. For example, a coordination service that supports the simtiltaneous 
allocation and use of multiple computational resources can be instantiated as a service instance, 
linked with an application as a library, or incorporated into yet higher-level services. 

It appears straightforward to re-engineer the resource management, data transfer, and information 
service protocols used within the current Globus Toolkit to build on these common mechanisms 
(see Figure 4). In effect, we can refactor the design of those protocols, extracting similar elements 
to exploit commonalities. In the process, we enhance the capabilities of the current protocols and 
arrive at a common service infrastructure. This process will produce Globus Toolkit 3.0. 



GRAM I I GridFTP | | MPS | | GRAM | | GridFTP | | MPS 
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Figure 4: On the left, some current Globus Toolkit protocols; on the right, a potential 
refactoring to exploit OGSA mechanisms. 

9 Related Work 

We note briefly some relevant prior and other related work, focusing in particular on issues 
relating to the secure and reliable remote creation and management of transient, stateful services. 

As discussed in Section 3.1, many OGSA mechanisms derive from the Globus Toolkit v2.0: in 
particular, the factory (GRAM gatekeeper [25]), registry (GRAM reporter [25] and MDS-2 [24]), 
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use of soft-state registration (MDS-2 [24]), secure remote invocation with delegation (GSI [33]), 
and reliable remote invocation (GRAM [25]). The primary differences relate to how these 
different mechanisms are integrated, with OGSA refactoring key design elements so that, for 
example, common notification mechanisms are used for service registration and service state. 

OGSA can be viewed as a distributed object system [2 1 ], in the sense that each Grid service 
. instance has a unique identity with respect to the other instances in the system, and each instance , 
can be characterized as state coupled with behavior published through type-specific operations. In 
this respect, OGSA exploits ideas developed previously in systems such as Eden [7], Argus [48], 
CORBA [1], SOS [60], Spring [51], Giobe [62], Mentat [42], and Legion [41, 43], among others. 
In contrast to CORBA, OGSA like Web services addresses directiy issues of secure 
. interoperability and provides a richer interface definition language. In Grid computing, the 
• Legion group has promoted the use of object models, and we can draw parallels between certam 
OGSA and Legion consttucts, in particular the factory ("Class Object"), handleMap (*|Binding 
Agent"), and timeouts on bindings. However, we also note that OGSA is nonprescriptive on 
several issues that are often viewed as central to distributed object systems, sueh as the use of 
object technologies in implementations, the exposure of inheritance mechanisms at the interface 
level, and hosting technology. 

Soft state mechanisms have been used for management of specific state in network entities within 
Intemet protocols [23, 61, 69] and (under the name "leases") in RMI and Jini [57], In OGSA, all 
services and information are open to soft state management. We prefer soft state techniques to 
alternatives such as distributed reference counting [12] because of their relative simpUcity. 
Our reliable invocation mechanisms are inspired by those used in Condor [36, 49, 50], which in 
turn build on much prior work in distributed systems. 

As noted in Section 4.3, core OGSA service behaviors will, in general, be supported via some 
form of hosting environment that simplifies the development of individual components by . 
managiiig persistence, security, lifecycle management, and so forth. The liqtion of a hosting 
environment appears in various operating systems and object systems. 

The application of Web services mechanisms to Grid computing has also been investigated and 
advocated by others (e.g.. [35, 37]), with a recent workshop providing overviews of a number of 
relevant efforts [2], Gannon et al. [37] discuss the application of various contemporary 
technologies to e-science appUcations and propose "application factories" (with WSDL 
interfaces) as a means of creating appUcation services dynamically. De Roure et al. [26] propose 
a "Semantic Grid," by analogy to the Semantic Web [1 1], and propose a range of higher-level 
services. Woric on service-oriented interfaces to numerical software in NetSolve [16, 17] andNinf 
[55] is also relevant. 

Sun Microsystems' JXTA system [3] addresses several important issues encountered in Grids, 
including discovery of, and membership in, virtual organizations— what JXTA calls '*peer 
groups." We believe that these abstractions can be implemented witfiin the OGSA firamework. 
There are connections to be made with component models for distributed and high-performance 
computing [8, 13, 68], some implementations of which build on Globus Toolkit mechanisms. 

10 Summary 

We have defined an Open Grid Services Architecture (OGSA) that supports, via standard 
interfaces and conventions, the creation, termination, management, and invocation olstateful, 
transient services as named, managed entities with dynamic, managed lifetime. 
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Within OGSA, everything is represented as a Grid service, that is, a (potentially transient) service 
that confoims to a set of conventions (expressed using WSDL) for such purposes as lifetime 
managenient, discovery of characteristics, notification, and so on. Grid service implementations 
can target native platform facilities for integration with, and of, existing IT infrastructures. 
Standard interfaces for creating, registeringi and discovering Grid services can be configured to 
create various forms of VO structure. 

The merits of this service-oriented model are as follows. All components of the environment are 
virtualized. By providing a core set of consistent interfaces from which all Grid services are 
implemented,.we facilitate the construction of hierarchal, higher-order services that can be treated 
in a uniform way across layers of abstraction. Virtualization also enables mapping of multiple 
logical resource instances onto the same physical resource, composition of services regardless of 
implementation, and management of resources within a VQ based on composition from lower- 
level resources. It is virtuialization of Grid services that underpins the ability for mapping 
common service semantic behavior seamlessly onto native platform facilities. 

Thc.development of OGSA represents a natural evolution of the Globus Toolkit 2.0, in which the 
key concepts of factory, registry, reliable and secure invocation, etc., exist, but in a less general 
and flexible form than here, and without the benefits of a uniform interface definition language. 
In effect, OGSA refactors key design elements so that, for example, common notification 
mechanisms are used for service registration and service state. OSGA also further abstracts these 
elements so that they can be applied at any level to virtualize VO resources. The Globus Toolkit 
provides the basis for an open source OGSA implementation. Globus Toolkit 3.0, tfiat supports 
existing Globus APIs as well as WSDL interfaces, as described at www.globus.org/ogsa..-^ 

The development of OGSA also represents a natural evolution of Web services. By integrating . 
support for transient^ statefiil service instances with existing Web services technologies, OGSA 
extends significantly the power of the. Web services firamework, while requiring only minor 
extensions to existing technologies. 
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Abstract 

In redent years, there has been a dramatic increase in the / 
amount of available computing and storage resources'. 
Yet few have been able to exploit these resources, in an 
aggregated form. We present the Condor-G system, which 
leverages software from Globus and Condor to allow 
users to harness multi-domain resources as if they all 
belong to one personal domain. We describe the structure 
of Condor-G and how it handles Job management, 
resource selection, security, and fault tolerance. 



1. Introduction 

In recent years the scientific community has 
experienced a dramatic plurahzation of computing and 
storage resources. The national high-end computing 
centers have been joined by an ever-increasing number of 
powerful regional and local computing environments. The 
aggregated capacity of these new computing resources is 
enormous. Yet, to date, few scientists and engineers have 
managed to exploit the aggregate power of this seemingly 
iniinite Grid of resources. While in principle most users 
could access resources at multiple locations^ in practice 
few reach beyond their home institution, whose resources 
are. often far from sufficient , for increasingly demanding 
computational tasks such as simulation, large scale 
optimization, Monte Carlo computing, image processing, 
and rendering. The problem is the significant "potential 
barrier" associated widi the diverse mechanisms, policies, 
failure modes, performance uncertainties, etc., that 
inevitably arise when we cross the boundaries of 
administrative domains. 

Overcoming this potential barrier requires new 
methods and mechanisms that meet the following three 
key user requirements for computing in a 'tjrid" that 
comprises resources at multiple locations: 
• They want to be able to discover, acquire, and 
reliably manage computational resources 
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dynamically, in the course of their everyday 

activities. 

• They do not want to be bothered with the location 
of these resources, the mechanisms that are 
required to use them, with keeping track of the 
status of computational tasks operating on these 
resources, or with reacting to failure. 

• They do care about how long their tasks are likely 
to run and how much these tasks will cost. 

In this article, we present an innovative distributed 
computing framework that addresses these three issues. 
The Condor-G system leverages the significant advances 
that have been achieved in recent years in two distinct 
areas: (1) security, resource discovery, and resource 
access in multi-domain environments, as supported within 
the Globus Toolkit [12], and (2) management of 
computation and harnessing of resources within a single 
administrative domain, specifically within the Condor 
system [20, 22]. In brief, we combine the inter-domain 
resource management protocols of the Globus Toolkit and 
the intra-domain resource management methods of 
Condor to allow the user to harness, multi-domain 
resources as if they all belong to one personal domain. 
The user defmes the tasks to be executed; Condor-G 
handles all aspects of discovering and acquiring 
appropriate resources, regardless of their location; 
initiating, monitoring, and managing execution on those 
resources; detecting and responding to failure; and 
notifying the user of termination. The result is a powerful 
tool for managing a variety of parallel computations in 
Grid environments. 

Condor-G*s utility has been denrionstrated via record- 
setting computations. For example, in one recent 
computation a Condor-G agent managed a mix of desktop 
workstations, commodity clusters, and supercomputer 
processors at ten sites to solve a previously open problem 
in numerical optimization. In this computation, over 
95,000 CPU hours were delivered over a period of less 
than seven days, with an average of 653 processors being 
active at any.one time. In another case, resources at three 



sites were used to simulate and reconstruct 50,000 high- 
energy physics events, consuming 1.200 CPU hours in less 
than a day and a half. 

In the rest of this article, we describe the specific 
problem we seek to solve with Condor-G, the Condor-G 
architecture, and the results obtabed to date. 

.2. Large-scale sharing of computational 
resources 

We consider a Grid environment in which an 
individual user may, in principle, have access to 
computational resources at many sites. Answering why the 
user has access to these resources is not our concern. It 
may be because the user is a member of some scientific 
collaboration, or because the resources in question belong 
to a colleague, or because the user has entered into some 
contractual relationship with a resource provider [14]. The 
point is that the user is authorilzed to use resources at those 
sites to perform a computation. The question that we 
address is how to build and manage a multi-site 
computation that uses those resources. 

Performing a computation on resources that belong to 
different sites can be difficult in practice for the following 
reasons: 

• Different sites may feature different authentication 
and authorization mechanisms, schedulers, 
hardware architectures, operating systems, file 
systems, etc, 

• The user has little knowledge of the characteristics 
of resources at remote sites, .and no easy means of 
obtaining this information. 

• Due to the distributed nature of the multi-site 
computing environment, computers, networks, and 
subcomputations can fail in various ways. 

• Keeping track of die status of different elemeirts of 
a computation involves tedious bookkeeping, 
especially in the event of failure and dependencies 
anK>ng subcomputations. 

Furthermore, the user is typically not in a position to 
require uniform software systems on the remote sites. For 
example, if all sites to v/h\ch a user had access ran DCE 
and DFS, with appropriate cross-realm Kerberos 
authentication arrangements, the task of creating a multi- 
site computation would be significantly easier. But it is 
not practical in the general case to assume such 
uniformity. 

The Condor-G system addresses these issu&s via a 
separation of concerns between the three problems of 
remote resource access, computation management, and 
remote execution environments: 
• Remote resource access issues are addressed by 
requiring that remote resources speak standard 



protocols for resource discovery and management. 
These protocols support secure discovery of remote 
resource configuration and state, and secure 
allocation of remote computational resources and 
management of computation on those resources. 
We use the protocols defined by the Globus Toolkit 
[ 1 2], a de facto standard for Grid computing. 

• Computation management issues are addressed via 
the introduction of a robust, multi-functional iiser 
computation management agent responsible for 
resource discovery, job submission, job 
management, and error recoveiy. This Condor-G 
component is taken firom the Condor system [20]. 

• Remote execution environment issues are addressed 
via the use of mobile sandboxing technology that 
allows a user to create a tailored execution 
environment on a remote node. This Condor-G 
component is also taken fii>m the Condor system. 

This separation of concerns between remote resource 
access and computation management has some significant 
benefits. First, it is significantly less demanding to require 
that a remote resource speak some simple protocols rather 
than to require it to support a more complex distributed 
confuting environment. This is particularly important 
given that the deployment of productiomGrids [4, 18, 27] 
has made it increasingly common that remote resources 
speak these protocols. Second, as we explain below, 
carefiil design of remote access protocols can significantly 
simplify compirtation management. 

3. Grid protocol overview 

In this section, we briefly review the Grid protocols 
that we exploit in the Condor-G system: GRAM, GASS, 
MDS-2, and GSI. The Globus Toolkit provides open 
source implementations of each. 

3.1. Grid security infrastructure 

The Globus Toolkit's Grid Security Infi^structure 
(GSI) [13] provides essential building blocks for other 
Grid protocols and for Condor-G. This authentication and 
authorization system makes it possible to authenticate a 
user just once, using public key infi^tructure (PKI) 
mechanisms to verify a user-supplied "Grid credential" 
GSI then handles the mapping of the Grid credential to the 
diverse local credentials and authentication/authorization 
mechanisms that apply at each site. Hence, users need not 
re-authenticate themselves each time they (or a program 
acting on their behalf, such as a Condor-G computation 
management service) access a new remote resource. 

GSI's PKI mechanisms require access to a private key 
that they use to sign requests. While in principle a user's 



private key could be cached for use by user programs, this 
approach exposes this critical resource to considerable 
risk. Instead, GSI employs the user^s private key to create 
a proxy credential, which serves as a new private-public 
key pair that allows a proxy (such as the Condor-G agent) 
to make remote requests on behalf of the user. This proxy 
credential is analogous in . many respects to a Kerberos 
ticket [26] or Andrew File System token. 

3.2. GRAM protocol and implementation 

The Grid Resource Allocation and Management 
(GRAM) protocol [10] supports remote submission of a 
computational request ("run program P") to a remote 
- con^utational resource, and subsequent monitoring and 
control of the resulting computation. Three aspects of the 
protocol are particularly . inq)ortant for our purposes: 
security, two-phase commit, and fault tolerance. The latter 
two mechanisms were developed in collaboration with the 
UW team and are not yet part of the GRAM version 
included in the Globus Toolkit, they will be in the 
GRAM-2 protocol revision scheduled for later in 2001 . 

GSI security mechanisms are used in all operations to 
authenticate the requestor and for authorization. 
Authentication is performed using the supplied proxy 
credential, hence providing for single . sign-on. 
Authorization implements local policy and may involve 
mapping the user's '*Grid id" into a local subject name; 
however, this mapping is transparent to the user. Work in 
progress will also allow authorization decisions to be 
made on the basis of capabilities supplied with, the 
request. 

Two-phase commit is important as a means of 
achieving "exactly once** execution semantics. Each 
request from a client is accompanied by a unique 
sequence number, which is also included in the associated 
response. If no response is received alter a certain amount 
of time, the client can repeat the request. The repeated 
sequence number allows the resource to distinguish 
between a lost request and a lost response. Once the client 
has received a response, it then sends a '^commit" message 
to signal that job execution can commence. . 

Resource-side fault tolerance support addresses the fact 
that a single 'Yesource" may often contain multiple 
processors (e.g., a cluster or Condor pool) with 
specialized "interface" machines . running the GRAM 
senrer(s) that maintain the mapping from submitting client 
to local process. Consequently, failure of an interface 
machine may result in the remote client losing contact 
with what is otherwise a correctly queued or executing 
job. Hence, our GRAM implementation logs details of all 
active jobs to stable storage at the client side, allowing 
this information to be retrieved if a GRAM server crashes 
and is restarted. This information can include details of 



how much standard output and enor data has been 
received, thus permitting a client to request resending of 
this data after a crash of client or server. 

33. MDS protocols and implementation 

The Globus Toolkit's MDS-2 provides basic 
mechanisms for discovering and disseminating 
information about the structure and state of Grid resources 
[9]. The basic ideas are single. A resource uses the Grid 
Resource Registration Protocol (GRRP) to notiiy other 
entities that it is part of the Grid. Those entities can then 
use the Grid Resource Information Protocol (GRIP) to 
obtain information about resource status. These two 
protocols allow us to construct a range of interesting 
structures, including various types of directories that 
support discovery of interesting resources. GSI.. 
authentication is used as a basis for access control. 

3.4.GASS 

The Globus Toolkit's Global Access to Secondary 
Storage (GASS) service [7] provides mechanisms for 
transferring data between a remote HT^RP, FTP, or GASS 
server. In the current context, we use these mechanisms to 
stage executables and input files to a remote computer. As 
usual, GSI mechanisms are used for authentication. 

4. Computation management: the Condor-G 
agent 

Next, we describe the Condor-G computation 
management service (or Condor^G agent). 

4.1. User interface 

The Condor-G agent allows the user to treat the Grid as 
an entirely local resource, with an API and cominand line ' 
tools that allow the user to perform the following job 
managenient operations: 

• submit jobs, indicating an executable name, 
input/output files and argxmients; 

• queiy a job's status, or cancel the job; 

• be informed of job termination or problems, via 
callbacks or asynchronous mechanisms such as 
email; 

• obtain access to detailed logs, providing a complete 
history of their jobs' execution. 

There is nothing ne>y or special about the semantics of 
these capabilities, as one of the main objectives of 
Condor-G is to preserve the look and feel of a local 
resource manager. The innovation in Condor-G is that 
these capabilities are provided by a persona] desktop 
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Figure 1. Remote execution by Condbr-G on Globus-managed resources 



agent and supported in a- Grid environment, while 
guaranteeing fault tolerance and exactly-once execution 
semantics. By providing the user with a familiar and 
reliable single access point to all the resources he/she is 
authorized to use, Condor-G empowers end-users to 
. improve the productivity of their computations by 
providing a unified view of dispersed resources. 

4.2. Supporting remote execution 

Behind the scenes, the Condor-G agent executes user 
computations on remote resources on the user's behalf It 
does this by using the Grid protocols described above to 
interact with machines on the Grid and mechanisms 
provided by Condor to maintain a persistent view of the 
state of the computation. In particular, it: 

• stages a job's standard I/O and executable using 
GASS. 

• submits a job to a remote machine using the revised 
GRAM job request protocol, and 

• subsequently monitors job status and recovers fix)m 
remote failures using the re vised. GRAM protocol 



and GRAM callbacks and status calls, while 
• authenticating all requests via GSI mechanisms. 

The Condor-G agent also handles resubmission of 
failed jobs, communications with the user- concerning 
unusual and erroneous conditions (e.g., credential expiry, 
discussed below), and the recording of computation on 
stable storage to support restart in the event of its failure. 

We have structured the Condor-G agent 
implementation as depicted in Figure 1. The Scheduler 
responds to a user request to submit jobs destined to run 
on Grid resources by creating a new GridManager 
daemon to submit and manage those jobs. One 
GridManager process handles all jobs for a single user 
and terminates once all jobs are complete. Each 
GridManager job submission request (via the modified 
two-phase commit GRAM protocol) results in the creation 
of one Globus JobManager daemon. This daemon 
connects to the GridManager using GASS in order to 
transfer the job's executable and standard input files, and 
subsequently to provide real-time streaming of standard 
output and error. Next, the JobManager submits the jobs 
to the execution site's local scheduling system. Updates 



on job status are sent by the JobManager back to the 
GridManager, which then updates the Scheduler, where 
the job status is stored persistently as we describe below. 
When the job is started, a process environment variable 
points to a file containing the address/port (URL) of the 
listening GASS server in the GridManager process. If the 
address of the GASS server should change, perhaps 
because the submission machine was restarted, the 
GridManager requests the JobManager to update the file 
with the new address. This allows the job to continue file 
I/O after a crash recovery. 

Condor-G is built to tolerate four types of failure: crash 
of the Globus JobManager, crash of the machine that 
manages the remote resource (the machine that hosts the 
GateKeeper and JobManager), crash of the machine on 
which the GridManager is executing (or crash of the the 
GridManager alone), and failures in the network 
connecting the two machines. 

The GridManager detects remote failures by 
periodically probing the JobManagers of all the jobs it 
manages. If a JobManager fails to respond, the 
GridManager then probes the GateKeeper for that 
machine. If the GateKeeper responds, then the 
GridManager knows that the individual JobManager 
crashed. Otherwise, either the whole resource 
management machine crashed or there is a network feilure 
(the. GridManager cannot distinguish these two cases). If 
only the, JobManager crashed, the GridManager attempts 
to start a new JobManager to resume watching the job. . 
Otherwise, the GridManager waits until it can reestablish 
contact with the remote machine. When it does, it attempts 
to reconnect to the JobManager. This can foil for two . 
reasons: the JobManager crashed (because the whole 
machine crashed), or the JobManager exited normally 
(because the job completed diuing a hetworic failure). In 
either case, the GridManager starts a new JobManager, 
which will resume watching the job or tell the 
GridManager that the job has completed. 

To protect against local failure, all relevant state for 
each submitted job is stored persistently in the scheduler's 
job queue. This persistent information allows the 
GridManager to recover from a local crash. When 
restarted, the GridManager reads the information and 
reconnects to any of the JobManagers that were running at 
the time of the crash. If a JobManager fails to respond, the 
GridManager starts a new JobManager to watch that job. 

43. Credential management 

A GSI proxy credential used by the Condor-G agent to 
authenticate with remote resources on the user's behalf is 
given a finite lifetime so as to limit the negative 
consequences of its capture by an adversary. A long-lived 
Condor-G computation must be able to deal with 



credential expiration. The Condor-G agent addresses this 
requirement by periodically analyzing the credentials for 
all users with currently queued jobs. (GSI provides queiy 
functions that support this analysis.) If a user's credentials 
have expired or are about to expire, the agent places the 
job in a hold state in its queue and sends the user an e- 
mail message explaining diat their job cannot run again 
until their credentials are refreshed by using a simple tool. 
Condor-G also allows credential alarms to be set. For 
instance, it can be configured to e-mail a reminder when 
less than a specified time remains before a credential 
expires. 

Credentials may have been forwarded to a remote 
location, in \^ich case the remote credentials need to be 
refreshed as well. At the start of a job, the Condor-G agent 
forwards the user's proxy certificate from the submission 
machine to the. remote GRAM server. When an expired 
proxy is refreshed, Cpndor-G not only needs to refresh the 
certificate on the local (submit) side of the connection, but 
it also needs to re-forward the refreshed proxy to the 
remote GRAM server. 

To reduce user hassle in dealing with expired 
credentials, Condor-G could be enhanced to work with a 
system, like MyProxy [23]. MyProxy lets a user store a 
long-lived proxy credential (e.g. a week) on a secure 
server. Remote services acting on behalf of the user can 
then obtain short-lived proxies (e.g. 12 hours) from the 
server. Condor-G could use these short-lived proxies to 
authenticate with and forward to remote resources and 
refresh them automatically firom the MyProxy server when 
they expire. This limits the exposure of the long-lived 
proxy (only the MyProxy server and . Condor-G have 
access to it). 

4.4. Resource discovery and scheduling 

We have not yet addressed the critical question of how 
the Cbndor-G agent determines where to execute user 
jobs. A number of strategies are possible. 

A simple approach, >^ich we iised in the initial 
Condor-G implementation, is to employ a user-supplied 
list of GRAM servers. This approach is a good starting 
point for further development. • 

A more sophisticated approach is to construct a 
personal resource broker ih2X runs as part of the Condor- 
G agent and combines information about user 
authorization, application requirements and resource 
status (obtained from MDS) to build a list of candidate 
resources. These resources will be queried to determine 
their current status, and jobs will be submitted to 
appropriate resources depending on the results of these 
queries. Available resources can be ranked by user 
preferences such as allocation cost and expected start or 
completion time. One pronnising approach to constructing 
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such a resource broker is to use the Condor Matchmaking 
framework [25] to implement the brokering algorithm. 
Such an approach is described by Vazhkudai et al. [28]. 
They gather information from MDS servers about Grid 
storage resources, format that information and user 
storage requests into ClassAds, and then use the 
Matchmaker to make brokering decisions. A simiJar 
approach could be taken for computational resources for 
use with Condor-G. 

In the case of high throughput computations, a sinq}]e 
but effective technique is to 'flood" candidate resources 
with requests to execute jobs. These can be the actual jobs 
submitted by the user or Condor "Glidelns" as discussed 
below. Monitoring of actual queuing and execution times 
allows for the tuning of where to submit subsequent jobs 
and to migrate queued jobs. 

S. Glidein mechanism 

The techniques described above allow a user to 
construct, submit, and monitor the execution of a task 
graph, with failures and credential expirations handled 
seamlessly and appropriately. The result is a powerful 



management tool for Grid computations. However, we 
still have not addressed issues relating to what happens . 
when a job executes on a remote platform where required 
files are not available and local policy may not permit 
access to local file systems. Local policy may also impose 
restrictions on the rxmning time of the job, \^^ich may. 
prove inadequate for the job to complete. These additional 
system and site policy heterogeneities can represent, 
substantial barriers. 

We address these concerns via what we call mobile 
sandboxing. In brief, we use the mechanisms described 
above to start on a remote computer not a user job. but a 
daemon process that performs the following functions: 

• It uses standard Condor mechanisms to advertise its 
availability to a Condor Collector process, which is 
queried by the Scheduler to leam about available 
resources. Condor-G uses standard Condor 

. mechanisms to match locally queued jobs with the 
resources advertised by these daemons and to 
remotely execute them on these resources [25]. 

• It runs each user task received in a "sandbox,** 
using system call trapping technologies provided 
by the Condor system [20] to redirect system calls 



issued by the task back to the originating system. In 
the process, this both increases portability and * 
protects the local system. 
• It periodically checkpoints the job to another 
location (e.g., the originating location or a local 
checkpoint server) and migrates the job to another 
location if requested to do so (for example, when a 
resource is required for another purpose or the 
remote allocation expires) [21]. 
These various functions are precisely those provided 
by the daemon process that is run on any computer 
participating in a Condor pool. The difference is that in 
Condor-G, these daemon processes are started not by the 
user, but by using the GRAM remote job submission 
protocol. In effect, the Condor-G Glidein mechanism uses 
Grid protocols to dynamically create a personal Condor 
pool , out of Grid resources by "gliding-in" Condor 
daemons to the remote resource. Daemons shut down 
gracefully when their local allocation expires or when they 
do not receive any jobs to execute after a (configurable) 
amount of time, thus guarding against runaway daemons. 
Our implementation of this "Glidein" capability submits 
an initial GHdeIn executable (a portable shell script), 
which' in mm uses GSI-authenticated GridFTP to retrieve 
the Condor executables from a central repository, hence 
avoiding a need for individual users to store binaries for 
all potential architectures on their local machines. 

Another advantage of using Glidelns is that they allow 
the Condor-G agent to delay the binding of an application 
to a resource until the instant when , the remote resource 
manager decides to allocate the resource(s) to the user. By 
doing so, the agent minimizes queuing delays by 
preventing a job from waiting at one remote resource 
while another resource capable of serving the job is 
available. By submitting Glidelns to all remote resources 
capable of serving a job, Condbr-G can guarantee optimal 
queuing times to its users. One can view the Glidein as an 
empty shell script submitted to a queuing system that can 
be populated once it is allocated the requested resources. 

6* Experiences 

Three very different examples illustrate the range and 
scale of application that we have already encountered for 
Condor-G technology. 

An early version of Condor-G was used by a team of 
four mathematicians from Argonne National Laboratory, 
Northwestern University, and University of Iowa to 
harness the power of over 2,500 CPUs at 10 sites (eight 
Condor pools, one Cluster managed by PBS, and one 
supercomputer managed by LSF) to solve a very large 
optimization problem [3]. In less than a week the team 
logged over 95,000 CPU hours to solve more than 540 
billion Linear Assignment Problems controlled by a 



sophisticated branch and bound, algorithm. This 
computation used an average of 653 CPUs during that 
week, with a maximum of 1007 in use at any one time. 
Each worker in this Master-Worker application was 
implemented as an independent Condor job that used 
Remote I/O services to communicate with the Master. 

A group at Caltech that is part of the CMS Energy 
Physics collaboration has been using Condor-G to 
perform large-scale distributed simulation and 
reconstruction of high-energy physics events. A two-node 
Directed Acyclic Graph (DAG) of jobs isubmitted to a 
Condor-G agent at Caltech triggers 100 simulation jobs on 
the Condor pool at the University of Wisconsin. Each of 
these jobs generates 500 events. The execution of these 
jobs is also controlled . by a DAG that makes sure that 
local disk buffers do riot overflow and that all events 
produced are transferred via GridFTP to a data repository 
at NCSA. Once all simulation jobs terminate and all data 
is shipped to the repository, the Condor-G agent at 
Caltech submits a subsequent reconstruction job to the 
PBS system that manages the reconstruction cluster at 
NCSA. 

Condor-G has also been used in the GridGaussian 
project at NCSA to prototype a portal for miming 
Gaussian98 jobs on Grid resources. This Portal uses 
Glidelns to optimize access to remote resources and 
employs a shared Mass Storage System (MSS) to store 
input and output data. Users of the jportal have two 
requirements for managing the output of their Gaussian 
jobs. First, the output should be reliably stored at MSS 
v^en the job completes. Second, the users should be able 
to view the output as it is produced. These requirements 
are addressed by a utility program called G-Cat that 
monitors the output file and sends updates to MSS as 
partial file chunks. G-Cat hides network performance 
variations from Gaussian by using local scratch storage as 
a buffer for Gaussian's ou^ut, rather than sending the 
output directly over the networic.. Users can view the 
output as it is received at MSS using a standard FTP client 
or by running a script that retrieves the file chunks from 
MSS and assembles them for viewing. 

7- Related work 

The management of batch jobs within a single 
distributed system or domain has been addressed by many 
research and commercial systems, notably Condor [20], 
DQS [17], LSF [29], LoadLeveler [16], and PBS [15]. 
Some of these systems were extended with restrictive and 
ad hoc capabilities for routing jobs submitted in one 
domain to. a queue in a different domain. In all cases, both 
domains must run the same resource management 
software. With the exception of Condor, they all use a 
resource allocation framework that is based on a system- 



wide collection of queues-«ach representing a different 
class of service. 

Condor flocking [II] supports multt-domain 
conq)utation management by using multiple Condor flocks 
to exchange load. The major difference between Condor 
flocking and Condor-G is that Condor-G allows inter- 
domain operation on remote, resources that require 
authentication, and uses standard protocols that provide 
access to resources controlled by other resource 
management systems, rather than the' special-purpose . 
sharing mechanisms of Condor. 

Recently, various research and commercial groups 
have developed software tools that support the harnessing 
of idle computers for specific computations, via the use of 
simple remote execution agents (workers) that, once 
installed on a computer, can download problems (or, in 
some cases, Java applications) from a central location and 
run them when local resources are available (i.e. 
SETI@hbme [19], Entropia, and Parabon). These tools 
assume a homogeneous environment where all resource 
management services are provided by their o^ system. 
Furthermore, a single master (i.e:, a single, submission 
point) controls the distribution of work amongst all 
available worker agents. Application-level scheduling 
techniques [5, 6] provide "personalized" policies for 
acquiring and managing collections of heterogeneous 
resources. These systems employ resource nianagement 
services provided by batch systems to make the resources 
available to the application and to place elements of the 
application . on these resources. An application-level 
scheduler for high-throughput scheduling that takes data 
locality information into account, in interesting ways has 
been constructed [8], Condor-G mechanisms complement 
this work by addressing issues of uniform remote access, 
feilure, credential expiry, etc. Condor-G could potentially 
be used as a backend for an application-level scheduling 
system. 

Nimrod [2] provides a user interface for describing 
'^parameter sweep" problems, with the resulting 
independent jobs being submitted to a resource 
management system; Nimrod-G [1] generalizes Nimrod to 
use Globus mechanisms to support access to remote 
resources. Condor-G addresses issues of failure, credential 
expiry, and inteijob dependencies that are not addressed 
by Nimrod or Nimrod-G. 
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Grid Computing Q& A with Benny Souder, Vice President, Distributed Database 
Development, Database and Application Server Technologies 

General O&As 

Q: Please explain the basic concept of grid computing. 

At the highest level, the central idea of Grid computing is computing as a utility. By that, 
we mean that you shouldn 7 care where your data resides, or what computer processes 
your request. You should be able to request information or computation and have it 
delivered- as much as you want, and whenever you want. This is analogous to the way 
electric utilities work, in that you don 7 know where the generator is, or how the electric 
grid is wired, you just ask for electricity, and you get it. The goal is to make computing a 
utility, a commodity, and ubiquitous. Hence the name, The Grid. 

This view of utility computing is, of course, a ^'client side " view. From the 'server side 
or behind the scenes, the Grid is about resource allocation, information sharing, and 
high availability. Resource allocation to ensure that all those that need or request 
resources are getting what they need, that resources are not standing idle while requests 
are going unserviced. Information sharing to make sure that the information users and 
applications need is available where and when it is needed. High availability since all 
the data and computation must always be there, just like a utility company must always 
provide electric power. 

Q: How did the concept of grid computing begin? What industries primarily use 
it? 

The idea of grid computing began in the academic and research communities. One of the 
earliest organizations active in Grids was CERN, which is also where the Web got its 
start In some ways, the Web and the Grid have interesting parallels. Both are 
disruptive technologies, both began in research institutions, both spread to commercial 
enterprises. And, as they spread to commercial enterprises, both evolve in ways that are 
different than originally intended. 

The first adopters of grid computing have been the financial, energy, and scientific 
industries. These are the commercial sectors that often adopt new technologies early. 
But more and more companies are starting to understand that grid computing has 
benefits for them, regardless of their industry. We 're seeing more and more activity 
around the Grid in all industries, and we expect that to continue. 

Q: How is Oracle involved in grid computing? 

Oracle has been involved in grid computing for years, as both an end-user and a vendor. 
We think that makes us unique among the major software vendors. 

As a Grid user, Oracle uses a grid to develop its database product. Oracle 's grid enables 
us to build the database faster, and with higher quality. It allows us to allocate 
resources to specific development projects when we need to. It gives us much more 
computing power than any other alternative computing investment wotddgive us. We 



believe that using a Grid gives us competitive advantages in our industry: Quality, 
productivity, and time to market. And, our use of Grids gives us insight into the problems 
our users will face adopting Grid computing, and helps us understand how to make our 
customers successful with Grid computing. 

As a Grid vendor, we think we can help customers gain the same kinds of benefits we Ve 
gained from Grid computing. Oracle has a strong technology stack, available today, 
which enterprises can leverage to reap grid computing benefits Oracle has key 
technology differentiators that make Oracle offerings for Grid computing unique. We 
think that the Grid needs to be open, interoperable, and standards based. Therefore, we 
are working with the Global Grid Forum to help develop Grid standards. We 're excited 
by Grid computing. We think Grid computing is the next big thing, and we think that it is 
already starting to happen. We think that as customers come to understand the Grid, and 
Oracle *s capabilities, they 7/ become excited, too. 

Q: What is the Global Grid Forum (GGF)? 

The Global Grid Forum is a standards body that is developing standards for Grid 
computing. It is comprised of a set of committees and working groups that focus on 
various aspects of Grid computing. The committees and working groups are composed 
of participants from academic, research, and, increasingly, commercial companies. 
Oracle is working with GGF to help develop Grid standards. 

A related organization is Globus. Globus is a project to develop open source Grid 
software. It predates GGF and unlike GGF it has full time staffing. 

Q: Why does Oracle believe grid computing is the next big thing after the Internet? 

The time is ripe for grid computing. There are a number of threads which, taken together, 
will make Grids unstoppable. 

• In today 's enterprises, people are concerned about affordability. Enterprises ore 
looking at ways of reducing costs and increasing the efficiencies of their processes 
and systems. Grid computing offers exactly that. Grid computing increases the 
utilization of enterprise resources. Grid computing is a way to consolidate your 
hardware, eliminating islands of underutilized computers. Instead, you can create 
centralized pools of computing and allocate computing resources to the priorities of 
your organization. 

• In hardware, every vendor has announced or is delivering "blades '\ Computer 
blades offer the lowest cost computing power, sometimes as much as 80% less than 
SMP. These blades can easily be assembled into ''blade farms which are the most 
effective and scalable form of commodity computing. And, these blade farms are now 
being fitted with interconnects, making them hardware clusters. As such, they form 
the most cost effective form of commodity clusters, which we believe is the future 
architecture of computing. 



• In software, Linux continues to grow faster than any other OS. Today, Linux cannot 
scale to large SMP. But since blades are 1-4 cpus, Linux runs well on them today. 
The economic advantage of blades over SMP will cause blades to dominate » and 
since Linux already works well for blades, this will accelerate Linux growth. Finally, 
Linux has a price advantage, which becomes more important as the number of blades 
grows, again accelerating Linux adoption. So commodity clusters naturally go well 
with Linux, the commodity OS. 

• In both the software and hardware industries, one of the big buzzwords at the 
moment is ''virtualization But nothing is more "virtual " than a utility. A lot of 
vendors ore trying to claim that their new strategy is ' virtualization or ''utilifyf 
computing " - which is exactly what Grid computing is all about. We think that these 
people will soon understand this and embrace the Grid, 

• In the technology industry, grid momentum is building. Some major vendors such as 
Oracle are offering grid-enabling technology. Others such as IBM are planning to 
offer grid-enabling technology. The Grid standards body, GGF, is in place and has 
support of all major technology vendors. 

• In IT organizations, Grid momentum is also building. Grid technologies promise 
increased utilization of existing hardware. Grids can let you allocate your resources 
to meet the needs of your business, instead of having islands of computing that are 
idle or overloaded. And, as existing hardware needs to be replaced, blades offer the 
lowest cost. The economics are so compelling that enterprises have already started 
leveraging blade servers for grid computing. 

In addition to these trends, there 's another reason why we believe the Grid is the next big 
thing. If you look at the Web, it is really about presentation of information over the 
Internet or your intranet. We think after presentation, the next logical step is processing. 
Processing information over the Internet or your intranet is exactly what the Grid is all 
about. So one way to think of it is that the Grid is the next phase of the Internet, after the 
Web. In 1997 it was hard to see everything the Web would become, but you could tell it 
was going to be big. That 's the state of the Grid today. 

Q: How will customers benefit from using Oracle's grids? 

Customers using Oracle 's Grids will realize higher resource utilization and tower costs. 
They will also benefit from Oracle 's superior operational characteristics - portability, 
availability, security and scalability. Oracle portability ensures you get the same 
operational benefits on all platforms including Linux and commodity clusters. Only 
Oracle can truly scale, provide high availability, and dynamically provision resources on 
low cost commodity clusters. Oracle makes the Grid unbreakable -you cannot break an 
Oracle grid and you cannot break into an Oracle grid. 

Oracle also has key Grid technology differentiators - such as Oracle Real Application 
Clusters, Oracle Streams, and Oracle Transportable Tablespaces, among many others. 
Most importantly, Oracle has a proven record of providing software for leading 
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platforms and environments. Oracle Grid customers can feel confident their investment 
in Oracle technology will be leveraged as the Grid evolves. 

Q: What type of Oracle customers would be interested in grid computing? 

Everyone wants to save money, control resource allocation, improve utilization, and get 
faster, better results. So every Oracle customer should be interested in Grid computing. 

The key to success in today 's economic reality is qffordability. Customers with multiple 
databases on dedicated hardware can use Grid computing to consolidate their hardw^are 
resources, and realize increased utilization and efficiencies. As customers replace their 
current hardware with newer, lower cost commodity hardn^are, Grid computing will let 
them capitalize on this newer hardware and gain even greater savings and efficiency. 
Every Oracle customer can benefit from Grid computing. We 7/ make it easy for our 
customers to move to the Grid. In the past, we \e made it easy to adopt advances such as 
Unix, Java, and the Internet. We 'II do it again for the Grid. 

Q: What would be the long-term benefits of using grids? 

Behind the scenes, on the server side. Grid computing changes the way enterprises look 
at resources. In place of buying resources for individual applications, enterprises will 
buy resources for all of an enterprise 's needs and provision them to individual 
applications on demand or based on policy This leads to efficient utilization of 
enterprises resources and also provides tremendous reduction in development, 
deployment, and management costs. It allows management to decide how to deploy the 
computing resources of the organization, and quickly change that deployment as the 
needs of the organization change. 

For users, or clients, Grid computing means that you no longer have to understand how 
or where the computation is performed. You never have to know when you are allowed 
to do certain things, or ask certain questions. Imagine if you could only plug in your hair 
dryer after the big generator on the edge of town started up, or if you could only watch 
the television if you turned off the lights! When these kinds of things happen to our 
power grid, we think there is a big problem - // 's national news - but today these kinds of 
considerations are normal for computing users. The Grid means to change that. 

Q: Why has grid computing taken so long to catch on? 

It hasn X really. If you look at the history of the Web you 'II see the Grid is probably 
catching on faster than the Web did And, like the Web, certain key problems have to be 
solved and enabling technologies have to be developed before the idea can really take 
off. To realize the benefits of the Grid requires a sophisticated software infrastructure 
and universal connectivity. The Internet provides universal connectivity. Now we re 
solving the infrastructure problems. 

Q: What does Oracle have to offer that will help promote or stimulate the growth 
of grid computing? 



Thai a good question, because we want to promote Grid computing. We think that it is 
a good idea, that it will help our customers, and that it is a great model for our products. 
We will promote and stimulate the growth of Grid computing in two ways. 

First, we want to make it easy for Oracle customers to move to Grid computing. We do 
that by educating them on what the Grid is, how it can help them, and how to get there. 
And, we make it easy to move existing applications built with Oracle to Grid computing. 
We think this is very^ important to our customers, but also to the Grid. The Oracle 
installed base is very large, and moving the installed base to Grid computing would 
increase the number of Grid computing users many, many times over. Now, of course, 
the entire universe of Oracle-based systems won V immediately move to Grid computing. 
But, we can help start the process and we think that 's good. 

Second, Oracle can support Grid standards and make sure Oracle products hove the 
capabilities needed for Grid computing. Oracle *s involvement in the Grid helps drive 
solutions to data management and information sharing challenges - Oracle understands 
data management and information sharing as few other companies do. By making our 
products work well for Grid computing, we moke it easy to be successful with this new 
model of computing. We 're taking our leading software products, and tailoring them to 
Grid computing. That is not only an endorsement but also is a very large contribution to 
Grid computing. 

Q: How much can customers expect to pay for Oracle grids? 
The short answer is there are no additional costs to move to Oracle grids. The same 
Oracle products you use today support Grid computing. In fact, if you ve deployed 
applications using Oracle, they are most likely immediately portable to Grid computing, 
with little or no change. 

In the future, we will include additional Grid features in Oracle products that will be 
available as you upgrade to newer product versions. Our strategy will be to continue to 
offer integrated software, so you can install it and immediately gain benefits without 
integrating the pieces or buying services. Delivering an integrated enterprise software 
stack is our business model, not selling services and customizations, pieces and parts. 

And, Oracle is making it s Grid SDK available now at no charge on the Oracle 
Technology Network, OTN, As future versions of the Globus Toolkit ore built, we 'II 
continue to support them and offer versions that are integrated with the Oracle stack. 

Q: Why do you keep saying, "Oracle resonates with the Grid"? 
Because I don 7 think we see the Grid as a fashion. I think if you look at all the ideas 
we 've tried to communicate in the last decade, they all line up, they all make sense with 
Grid computing. We 've talked about commodity computing, clusters and cluster 
databases for 10 years. We ve talked about consolidation as long as I can remember. 
We 're alone among the major software vendors in that we 've stuck to our belief in 
outsourcing, which it exactly in line with utility computing. Our themes around the 
Internet, Unbreakable, and Linux are perfectly aligned with the Grid. 



Honestly, I can V think of a company that has consistently been on message with the 
themes that today we call Grid computing, and been on that message for the last 10 
years. So I don't think Oracle is going to approach the Grid as ''this years 'fashion '\ I 
think the Grid resonates deep with the core values and beliefs of Oracle. 

Technology OA As 

Q: What is the diflerence between data grids and computing grids? 
Compute grids and data grids are really the same thing. They utilize the same 
infrastructure. With a compute grid, the focus is on providing a computing resource. 
With data grids, the focus is on providing a data resource. In reality, very few problems 
are strictly compute or data. Most enterprises will require sharing both compute and 
data resources. We think that for commercial users, the distinctions are not important. 
In fact, we question how useful the distinction is to any Grid user. 

That said, there are a lot of different implementations using the term Grid. To make it 
easier to understand, we 've developed a taxonomy of Grids... a classification scheme, if 
you wilL 

There are three phases. Scavenging, Sharing, and Dedicating. Each phase has hallmarks 
that make it pretty easy to take a given implementation and classify it. Each phase offers 
some benefits, but succeeding phases offer more benefits. Customers enter the grid at 
any^ of the three phases, but gravitate to later phases to gain increasing benefits. Today, 
we 've already progressed through the maturation cycle to where most companies enter at 
Sharing or Dedicating. Within a year, I expect most customers to enter directly at 
Dedicating. Dedicating is where we aim with our technology, since it is where we most 
differentiate and where customers find most value. I'm happy to spend time explaining 
this in more detail - and perhaps we should bring this information out in more detail, it 
helps people understand why all these ''different " things have the same name. And it 
may help move users along to Dedicating, which is good for them and for us. Oracle 
employees, at least, seem to really like the taxonomy The only downside is you can only 
tell the average listener so much, and time we spend explaining taxonomies is time we 
can 't spend explaining why we are better. But, education is good, and having this 
taxonomy may come in useful. 

Q: How is information shared in Oracle Grids? 

Information can be shared in a variety of ways. For information that is not frequently 
accessed, it may be most efficient to access it remotely as required. Oracle provides 
distributed SQL features that can transparently query or update data in other Oracle 
databases, making the data appear to be local. For data stored in non-Oracle databases, 
our distributed SQL features work in conjunction with our gateways to make the data 
appear to be local, and appear to be in the Oracle database. We offer Transparent 
Gateways for a number of database systems and a free generic gateway that gives access 
to any ODBC-compliant database. 



It may be more efficient to move the data in bulk and access it locally, if the data is 
frequently accessed, Oracle Transportable Tablespaces allow Oracle data files to be 
unplugged from a database, moved or copied to another location, and then plugged into 
another database. Unplugging or plugging a data file involves only reading or loading a 
small amount of metadata. This makes Transportable Tablespaces a very fast 
mechanism for moving Oracle data. Transportable Tablespaces also supports 
simultaneous mounting of read-only tablespaces by two or more databases. 

Some data that needs to be shared as it is created or changed, rather than occasionally 
shared in bulk, Oracle Streams can stream data between databases, nodes or blade 
farms in a Grid, It provides unified framework for information sharing, combining 
message queuing, replication, events, data warehouse loading, notifications and 
publish/subscribe into a single technology, Oracle Streams can keep two or more copies 
in sync as updates are applied. Streams automatically captures database changes, 
propagates the changes to subscribing nodes, applies changes, and detects and resolves 
any conflicts. Streams can also be used directly by applications as a messages queuing 
feature, enabling communications between applications in the Grid, 

Q: Could there be a "gridnet'' - like an Internet? 

Perhaps. If this does come to pass, it likely will be as the Internet and intranets " are 
used today. So, there would be some support for free public utility computing over the 
Internet, but commercial users will likely keep most of their Grid inside the firewall, as 
companies do today with intranets. 

Also, it is possible that using technology like VPN, service providers will offer 
outsourcing or hosting services across the Internet using Grid technologies. The same 
advantages Grid computing gives companies will be equally important to outsourcing 
suppliers - greater utilization will result in lower costs, which translate into lower rates 
for users, which could help to speed adoption of outsourcing. And, that 's as it should be, 
since outsourcing is very much computing as a utility. You don 7 know where the 
computer is, and you don V care, you get the applications, information, and computing 
you need, 

Q: What are the benefits of using grids vs. using supercomputers? 
Grids and supercomputers are not exclusive, A supercomputer can be a resource on a 
Grid, In fact, the quest for maximum utilization of expensive supercomputers was an 
early motivation for the Grid, Grid technology unlocks the potential of alternatives to 
supercomputers such as farms of inexpensive commodity server blades. These will 
provide much higher compute capacities at the fraction of the cost of supercomputers. 
The trick is getting software that can efficiently utilize a blade farm. For databases, 
Oracle is clearly the superior choice for blade farms. 

Q: Are grids a good step or even a likely one for mainstream decision support? 
Some of the earliest uses of Grid computing were to analyze massive amounts of 
scientific data. So, from the start. Grid computing has had a goal of supporting analytic 
operations and large amounts of data. Grid technologies enable efficient resource 



utilization, and flexible resource allocation. That 's just as good for DSS as it is for 
OLTP . Grid technology can make compute rich environments a\^ailable to DSS 
applications, if that is your priority. But. perhaps more importantly, the right Grid 
technology can allow you to provision compute resources to DSS or OLTP as the needs 
of your business change. 

For example, suppose you are a large internet retailer. At Christmas, you want to 
allocate all your resource to your website, to maximize sales and minimize response 
times. But. after Christmas, your website is virtually idle. Everyone bought everything at 
Christmas Now, you have mountains of purchase and clickstream data. You want to 
allocate all your resources to analyze this data. That way. you can improve your 
marketing You can get better plans for next year. If you have two isolated SMPs, one 
for the website and one for the data warehouse, this reallocation of resources is difficult 
or impossible. But. using Grid technology, you can reallocate. And with Oracle, you can 
reallocate easily. 

Q: What does Oracle provide for its developers in the area of grid computing? Are 
there toolkits available? ^ 

Well, we 're making the first version of the Oracle Grid SDK available on OTN. The 
SDK makes it easy to use the Oracle along with the Globus Toolkit. The Globus Toolkit 
is a set of useful components that can be used either independently or together to develop 
Grid applications. The first version of our SDK implements an open source, or reference 
version, of the APIs of the Globus Toolkit mapped to Oracle APIs. So, again, we Ve done 
the integration for you. Here 's the details: 

• Globus Resource Allocation Manager (GRAM): GRAM provides resource allocation 
and process creation, monitoring, and management services. GRAM implementations 
map requests expressed in a Resource Specification Language (RSL) into commands 
understood by local schedulers and computers. We Ve mapped GRAM to Oracle. 
This enables jobs specified in Globus RSL to be mapped to Oracle scheduled jobs, 
and to Oracle stored procedures. Thus, if you use GRAM, we Ve already done the 
work to integrate that with Oracle. 

• Monitoring and Discovery Service (MDS): MDS is on extensible Grid information 
service that combines data discovery mechanisms with the Light-weight Directory 
Access Protocol (LDAP). MDS provides a uniform framework for providing and 
accessing system configuration and status information such as compute ser\^er 
configuration, network status, or the locations of replicated data. MDS provides a 
Grid Resource Information Service (ORIS) that can be used to get details of an 
individual resource on the Grid We Ve integrated with GRIS, to expose Oracle 
database attributes and properties through LDAP. This lets you find Oracle 
databases on a Grid, and determine if that particular database has the information or 
content you need, 

Q: What specific Oracle products are used to support grids? 

// *s important to understand the simplicity of the Oracle Grid strategy. Oracle is not 
building new products for Grid computing. Oracle is incorporating Grid capabilities 



into all of iis present products. So, the Oracle products that support Grid computing 
now, and in the future, are the same Oracle products you know today: Oracle 9i, Oracle 
lAS, and the technology stack built on top of them. We 're offering support for Globus 
protocols and Globus resource discovery in our Grid SDK, which is free. When you 
move to Grid computing, you won 't have to learn a new enterprise software stack, it will 
be the same stack you know how to use now. 

Our goal is to allow everything built on the Oracle stack to move to Grid computing. We 
may not achieve this goal, but I think we will be able to allow the vast majority of 
Oracle-based applications to function in Grids. So, in a sense, every technology Oracle 
offers is relevant to Grids, 

So, if you are interested in getting started with Grid computing, we have the technology 
you need But, if you want to wait, we *ll make it easy for you to adopt Grid computing 
when you are ready, and protect your investment in Oracle. 

Q: Will Oracle incorporate its Web Services technology into its grid computing 
products? 

Web services technology will be useful for Grid computing. Web services offer a well- 
defined communication mechanism. We have a complete Web Services technology 
offering, and it easily supports Grid computing. But, it is important to realize that web 
services are just one of the interfaces with which Grid entities can communicate. 
Because Grid users come from diverse backgrounds and are familiar with diverse 
development environments, Oracle will provide a choice of development environment 
and communication models. 

Q: Will Oracle incorporate its clustering technology into its grid computing 
products? 

Yes, RAC will be a key differentiator in the Grid, It enables the use of lowest cost 
hardware. It lets databases dynamically add and release resources. This ability to add 
and drop resources is critical to improving utilization and efficiency, and thus to 
reducing costs and improving productivity. It makes databases on this lowest cost 
hardware highly available. And, it lets you run real applications on this hardware. 
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Grid benefits 




Grid computing goes far beyond sheer computing power. Today*s 
operating environments must be resilient, flexible and integrated as 
never before. Organizations around the world are experiencing 
substantial benefits by implementing grids in critical business 
processes to achieve both business and technology benefits. 

Business benefits 
Accelerate time to results: 

• can help improve productivity and collaboration 

• can help solve problems that were previously unsolvable 

Enable collaboration and promote operational flexibility: 

• bring together not only IT resources but also people 

• allow widely dispersed departments and businesses to create 
virtual organizations to share data and resources 

Efficiently scale to meet variable business demands: 

• create flexible, resilient operational infrastructures 

• address rapid fluctuations in customer demandsjneeds 

• instantaneously access compute and data resources to "sense and 
respond" to needs 

Increase productivity: 

• can help give end users uninhibited access to the computing, data 
and storage resources they need (when they need them) 

• can help equip employees to move easily through product design 
phases, research projects and more — faster than ever 

Leverage existing capital investments: 

• can help you improve optimal utilization of computing 
capabilities 

• can help you avoid common pitfalls of over-provisioning and 
incurring excess costs 

• can free IT organizations from the burden of administering 
disparate, non-integrated systems 

Technology benefits 
Infrastructure optimization: 

• consolidate workload management 

• provide capacity for high-demand applications 
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- reduce cycle times 

Increase access to data and collaboration: 

• federate data and distribute it globally 

• support large multi-disciplinary collaboration 

• enable collaboration across organizations and among businesses 
Resilient, highly available infrastructure: 

• balance woricloads 

• foster business community 

• enable recovery and failure 
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Abstract 

We describe MW - a software framework that allows users to. 
quickly and easily parallelize scientific computations using the master- - 
worker paradigm on the computational grid. MW provides both a "top . 
level" interface to application software and a "bottom level" interface 
to e^cisting grid computing toolkits. Both interfaces are briefly de- 
scribed. We conclude with a case study, where the necessary Grid ser- 
vices are provided by the Condor high-throughput computing system, 
and the MW-enabled application code is used to solve a combinatorial 
optimization problem of unprecedented complexity. 
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1 Introduction 



By its very definition, the Grid [11] is a powerful and complex computing 
environment. In order to help harness its power, a large number of different 
programming efforts are underway that seek to provide, robust middleware 
services [10] [14] [17] [9] [3] [21]. For users hoping to parallelize a large, sin- 
gle, coordinated application over the Grid, the overhead required to learn 
and assemble these Grid-enabling software components could (at this stage 
of their implementation) be discouraging.. Thus, to enable a larger commu- 
nity of users to build applications running in parallel on the Grid, higher- 
level programming frameworks leveraging existing Grid services software are 
needed. NetSolve [4] provides an API to access and schedule Grid resources 
in a seamless way but it is not suited for writing non-embarrassingly parallel 
codes. Eveiryware [22] is a heroic effort that shows that an application can 
draw computational power transparently from the Grid, but Everyware is. 
not abstracted as a programming tool at this stage of its implementation. 
CARMI/Wodi [20] was a useful programming interface for developing master- 
worker based parallel applications to run on the Grid, but it was strongly 
tied to the Condor-PVM [19] software tool,'limited to applications with fixed 
work cycles, and finally abandoned. . ' 

Our abstract programming framework MW is a complete, easy to use 
tool whereby users can distribute large, diverse, scientific computations in 
a Grid computing environment. The focus is on parallel applications with 
weak synchronization arid reasonably large grain size that can be fit into a 
master-worker paradigm without significant loss of eflSciency. To parallelize . 
such algorithms on Grid computing platforms^ users must address issues such 
as fault tolerance, task scheduling, and interprocess communication. By han- 
dling some of these issues automatically and exposing others, MW provides 
an API for rapidly implementing Grid-enabled master-worker algorithms. 
MW also abstracts an Infrastructure Programming Interface (IPI) such that . 
it can be ported to use various Grid software toolkits without any changes 
from the application developer. MW has been used in the MetaNEOS project 
[18] to implement efficient parallel numerical optimization algorithms with 
complex control structures. The marriage of eflScient algorithms with Grid 
computational resources has allowed the solution of prbblems of record break- 
ing sizes [2] [15]. 

The paper is organized as follows. In Section 2, we introduce MW, and we 
describe the interfaces to both application software and Grid infrastructure 



software. Section 3 discusses additional features of MW that help developers 
build efficient and robust applications. Section 4 presients a case study where 
the Grid services are provided by Condor[17], and the application code is used 
to solve a combinatorial optimization. problem of unprecedented complexity. 
Conclusions about this line of research are also given. 

2 MW 

MW is a software framework that allows a user to easily parallelize a master- 
worker application on Grid iresources. MW is a set of C++ abstract classes 
providing interfaces to both application programmer and Grid-infrastructure 
programmer. To. Grid-enable an application with MWj the application pro- 
grammer must re-implement a small number of virtual functions. Likewise, 
to port the MW. framework to a new Grid software toolkit, the Grid in- 
frastructure programmer need only re-implement a small number of virtual 
functions. 

2.1 Infrastructure Interface 

To distribute a master-worker computation on the Grid, we at least. require 
software that can perform 

• Communication - Portions of the computation and results must be 
passed between master and workers, 

• Resource Management - The state of the available computational re- 
sources on the Grid -must be known. 

Our usage of the term resource management is a bit broader than most. 
In this context, resource management encompzisses 

• Resource request and detection - Asking for and identifying available 
processors. 

• Infrastructure querying - Determining information about processors 
and the interconnections between them, 

• Fault-detection - Noticing when processors leave the computation 
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• Remote execution - Starting processes on remote machines. when they - 
become available. 

There are a number of tools being built that provide these basic services, 
as well as features necessary to other Grid applications (such as security 
and remote data access). The Infrastructure programming interface (IPI) 
abstracts the core communication and resource managemeiit requirements 
for master- worker applications into the MWRMComm class. To allow MW 
applications to interax:t with existing Grid-services software, a concrete in? 
stance of the abstract MWRMComm class is derived, where the functionality 
required by MWRMComm is provided by the services in the specific Grid 
software toolkit. 

2.1.1 Communication 

The sole communications functionality required by MWRMComm is that 
point-to-point messages can be sent. between the master and the worker pro- 
cesses. As such, MWRMComm has the (pure) virtual functions: 

• pack( <type> €u:ray, iht size ) • ' 

• unpack ( <type> array, int size ) 

• send( int to.whom, int messagejiag ) 

• recv( int from^vhom, int messagejtag ) 

All messages must be buffered by the MWRMComm implementation, 
and the send() function should be i in piemen ted as a nonblocking call. These 
design criteria are due to the fact that processors may disappear during the 
course of the computation. Since the Grid is heterogeneous, the packO 
and unpack () functions must account for different native data types. In 
MWRMComm, the recv() routine should be implemented as a blocking 
function call. 

2.1.2 Resource management 

The application programmer may rnake a resource requests by calling the 

function MWRMComm: :set>tcLrget-num>.workers( int num-workers ). 

It is up to the MWRMComm implementation to make appropriate resource 



requests in an attempt to garner this number of workers for the master- 
worker application, and also to make new requests if participating workers 
leave the computation* 

An important design decision for MW is that both communicaticsn and 
resource management functionality is included in a common class. The reason 
behind this decision is that MW requires that all information about the state 
of the computational resources be passed to it in the form of messages with 
specific tags such as HOSTADD and HOSTDELETE. Thus, an implementation of 
the (blocking) MWRMComm: :recv() function on the master process should not 
only test for incoming messages from workers, but also check for changes to 
the state of the existing computational resources and report these changes 
as messages. 

When a HOSTADD message is received, the MWRMComm specification 
requires that the function call MWRMComm: : start-Worker(MWWorkerID *.w) 
yfiW (attempt to) . start a remote process on the machine that has been added, 
and will assign a unique process identifier in the MWWorkerlD. When a 
HOSTDELETE message is received, MWRMComm requires that the unique, 
process identifier be packed in the message buffer. 

A final important function in the MWRMComm class is 
MWRMComm: :get-workerJ.nfo( MWWorkerlD ♦w ). This function uses un- 
derlying Grid services to populate the MWWorkerlD class with "useful" in- 
. formation about the remote processor. Data members of the MWWorkerlD 
class include the architecture, operating system, amount of memory, disk 
space, and speed of the remote machine. 

Clearly, this is not the entire specification of the MWRMComm class. 
Indeed, we consider the IPI that we have laid out in MW to be a work in 
progress. The interface will likely change, and additional functionality will 
be added as warranted. Due the layered design of MW, application programs 
will be shielded from the interface changes. 

2.1.3 Example MWRMComm Implementations 

There are currently two implementations of the MWRMComm class. Both 
rely on the resource management facilities provided by the Condor high- 
. throughput computing system [17]: As such, the MWDriver must deal with 
many processor faults, since the default Condor behavior is to vacate a run- 
ning process when the "owner" of the machine returns. 

In one implementation, communication is done with PVM, and in the 
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other, communication is done by using Condor's remote I/O mechanism [16] 
to write a series of shared files. Preliminary plans are being made for a 
port to the Globus software toolkit [10]. Table 1 highlights how the Grid 
service software provides (or could provide) the functionality required by 
MWRMComm. 

There are advantages and disadvantages to having MW act as an "upper 
middleware" layer between application. code and Grid service software code. 
The additional software layer acts as a filter, hiding complexity of Grid service 
software, but also potentiaJly hiding underlying functionality and knowledge 
of how the communication and resource management services are performed. 
A significant challenge is how to impart this functionality and knowledge to 
the application programmer, while still presenting a simple interface. MW 
errs on the side of simplicity, with the thought that additional Grid service 
functionality will be made available to the application programmer as needed. 

An advantage of the layered approach is that some, advances in Grid ser- 
vices software can be leveraged by the. application programmers to increase 
application performance. For our Condor-based MWRMComm implenien- 
tations, two examples include flocking [8], where geographically distributed 
Condor pools are conceptually linked as one, and glide-in [7], where proces- 
sors from an existing Globus resource can be added to a Condor pool on a 
temporary basis. These advanced Condor features are used by the applica- 
tion presented in Section 4. 

2.2 MW Application Programming Interface 

In a companion work [13], we argue that many scientific applications can be ' 
parallelized quite effectively for a Grid environment by using the master- 
worker paradigm. Our specific experience is with algorithms for solving 
numerical optimization problems and many of these algorithms share the 
following characteristics: 

• Incremental Data Requirement - A potentially large amount of data 
must be passed to worker processes at initialization, but thereafter, 
messages are small "incremental" changes to the initial data. 

• Weak Synchronization - The ability to execute a task does not depend 
on the completion of a large number of other tasks. 
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Messages buffered 
and passed through 
PVM pvm-pkO in 
XDRformat. 


Messages passed 
through shared 
worker files via 
Condor Remote I/O. 


ivies sages passeo anu 
handled via Nexus 
nexus jsendjrsrO . 


Resource 

Request 

and 

Detection 


Requests formulated 

Ads, served by 
Condor 

matchmaking, and 
detection is notified 
by pvnLJiotif y. 


Requests formulated 
with Condor Class 
Ads, served by 
Condor 

matchmaking €md 
detected, by 
checking Condor 
logs. 


Requests in Globus 
RSL handled and 
queued by GRAM 
via grajn_client 
„j ob -request ( ) . 


Info 

Querying 


Information 
collected via 
condor-status 
command 


Information 
collected via 
condor jstatus 
command 


Information queried 
from MDS via 
LDAP protocol. 


Fault 
Detection 


Coridor-PVM and 
passed through 
pvmjtiotifyO. 


Faults detected by 
checking Condor 
logs. 


Faults detected by 
HBM local monitors 
are collected by 
HBM data collector 
agent running on 
master. 


Remote 
Execution 


Job started by 
pvin_spawn(). 


Job started by 
condor-startd 
daemon on remote 
resource 


Job started by 
GRAM when 
requests are served 



Table 1: Summary of How Grid Services are Provided 



• Dynamic Grain Size - The computation can natxirally be broken into, 
portions of work of variable size. ! 

The MW API was designed to. provide ah interface that would be easy 
for application programmers to use, but also would allow these algorithmic 
characteristics to be exploited to build efficient master-worker applications. 

In order to parallelize an application with MW, the application program- 
mer must re-implement three abstract base classes - MWDriver, MWTask, 
and M W\^Vker. 

2.2.i MWDriver 

To create the MWDriver, the user need only implement four pure virtual 
functions: 

• get-user.info( int argc, char ♦argvQ )- Processes arguments and 
does basic setup. 

• setup^nitial-tasksC int *n, MWTask ***tasks Returns a set 
of tasks for the computation to begin work on. . . - \ 

• packjtforkerJ.nitjdata()- Packs the initial data to be sent to the 
worker upon startup. Use of this function allows the application to 
exploit an incremental data requirement. 

• act^on_completedJ:ask( MWTask *task )- Is called every time a task 
finishes. Some actions that the user could take include adding more 
tasks or making calculations based on the result of the task. 

By carefully deciding on actions to take in the act-.on-completed_task() 
method, the user can take advantage of a weak synchronization inherent in 
the parallel application. 

The MWDriver manages a set of MWTasks and a set of MWWorkers to 
execute those tasks. The MWDriver base class handles workers joining and 
leaving the computation, assigns tasks to appropriate workers, and rematches 
running tasks when workers are lost. All this complexity is hidden from 
the application programmer. Further, the MWDriver offers more advanced 
functionality, as explained in Section 3. . 
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2.2.2 MWKisk 

The MWTask is the abstraction of one unit of work. The class holds both the 
data describing that task and the results computed by the worker. By decidr 
ing on the size of the task, the application can use dynamic grain size to its 
advantage, easing contention at the master process, and increasing parallel 
efficiency. The derived task class must implement Ifunctions for. sending and 
receiving its data between the master and worker. The names of these func- 

.tions.are self-explanatory: pack-.work(), unpack-workO, pack-results (), 
and unpack.results(). These functions will call associated pack () and 

. unpackO functions in the MWRMComni class: 

2.2.3 MWWorker 

The MWWorker class is the core of the worker executable. Two pure virtual 
functions must be implemented: 

9 unpack_init-data()- Unpacks the initialization information passed in 
the MWDriver's pack_worker_init_data(). 

• execute.taskC MWTask *task Given a task, computes the results. 

After doing some basic initialization, the MWWorker sits in a simple 
loop. Given a task, it computes the results, reports the results back, and 
waits for another task. The loop finishes when the master asks the worker to 
end. It is an easy matter to bring in other libraries, such as highly optimized 
FORTRAN routines to the worker. They can be linked with the C+-|- code, 
and called by the execute-task () function. 

3 Additional Functionality 

In addition to the necessary services provided by the MWDriver and the 
MWRMComm implementation, users of the MW-framework benefit from a 
number of other useful features that are available through methods in the 
base MWDriver class. 
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3.1 Checkpointing 

Because the MWDriver reschedules tasks when the processors running these 
-tasks fail/applications running on top of MW are fault tolerant in the pres- 
ence of all processor failures — except for the master processor. In order to 
make computations fully reliable, MWDriver offers features to logically check-^ 
point the state of t*he computation on the master process on a user-defined 
frequency. To enable checkpointing, the user niust implement functions for 
writing and reading the state contained in its application's master and task 
classes. .Use of the master checkpoint facility is demonstrated in Section 4: 

3.2 NorniaHzed Perforniance Measurement 

The heterogeneous iand dynamic nature of the Grid makes application per- 
formance difficult to assess. Stajidard performance measures such as wall 
clock time and cumulative CPU time do not separate application code and 
computing platform performance. By normalizing the CPU time spent oTf a 
given task with the performance of the corresponding worker, the MWDriver 
aggregates time statistics that are comparable between runs. The normaliza- 
tion factor can be based on vendor information such as MIPS or KFLOPS, 
if this information is available from the underlying Grid service software. 
Alternatively, MW allows the user to register an application specific bench- 
mark task that is sent to all workers that join the computational pool. The 
speed at which the benchmark task is completed is used as the normalization 
factor. 

If we make the following definitions: 

• a{i) - Worker i performance normalization factor, 

• - Worker t uptime, 

• - Index of worker who solved task j, 

• t{j) - User time spent by worker w{j) at solving jV 

• W - Wall clock time, 

• T - Cumulative workers CPU time. 
We can then define the following statistics: 
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• T— NormaJized cumulative time : 



r=i:"Ni))*^(i) 

• V - Equivalent Pool Performance : . 

• AT - Average number of workers : . 

W 

• 17 - Parallel efficiency : • 

. ^ E.€rf/(i) 

Table 3.2 shows the variations of performance statistics between runs 
of a Grid-enabled application (presented in Section 4). The same problem 
instance was solved eight times j each time on a different set of processors. A 
user-defined benchmark task was used to define the normalization factor. 





Mean 


Std. Dev. 


Min 


Max 


w 


915 


1019 


489 


1780 


T 


22182 


27900 


8844 


37671 


T 


5864 


341 


5739 


6054 


V 


7.27 


7.16 


3.2 


12.4 




27.5 


21.7 


16 


39 


n 


0.87 


0.07 


0.84 


0.92 



Table 2: Mean, Variance and Extreme Value on 8 different runs. 

As expected, the statistics show the large variance of W and T. However, 
there is little variance of T, which can therefore be used to do comparisons 
between nms and assess the application performance. 
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3.3 Task Scheduling 

Internally, the MWDriver manages a list of workers and a list of tasks. Task 
scheduling is accomplished by assigning the first task in the task list to the 
first idle worker in the worker list. In MWDriver, there is an interface to 
specify that the task list be ordered by a user-defined key, ensuring that 
"important" tasks are performed first. The worker list may be similarly 
ordered, so that "good" machines are the first to receive tasks. By default, the 
worker list is ordered using the machine KFLOPS information (if provided 
by the Grid software, implementing MWRmComm), or by the benchmark 
factor if the user has registered an application specific benchmark task. 

While this is a rudimentary scheduling aJgorithm, it has proven suflBicient 
for all applicatioils implemented to date with MW. The applications have 
had no need to match specific tasks with specific workers. Also the applica- 
tions are not data-intensive, so use of advanced services such as the Network 
Weather Service. ?? to improve scheduling has not been warranted. 

■ ■ - . 

4 Application to Combinatorial Optimization 

MW has been used in the MetaNEOS project [18] to implement several grid 
enabled parallel optimization solvers [6] [12] [15]. One solver has been special- 
ized to solve the quadratic assignment problenri (QAP) [5]. Despite its simple 
statement— to minimize the assignment cost of n facilities to n locations— it 
is extremely diflicult to solve even modest sized instances of the QAP. Prob- 
lems with n > 20 are diflScult; problems with n > 30 have not even been 
attempted yet. By embedding a new relaxation technique [1] into a branch- ' 
and-bound framework, and implementing the resulting solver within M W, 
we managed to solve what is regarded by experts in the field as the most 
difficult QAP instance to provable optimality [2]. 

In order to use the computational resources with maximal eflSciency, the 
parallelization strategy of the branch-and-bound tree search has been care- 
fully designed. Issues such as the proper ordering of the task list and ^the 
selection of the grain size were carefully considered in order to minimize 
communication overhead and contention at the master process without in- 
troducing large parallel search anomalies. By using the intuitive MW API, 
implementing the parallel version of the sequential branch-and-bound code 
was extremely simple and fast. . 
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The MW-ized Q AP application code was compiled to use the Condor/File- 
Based MWnMComm implementation. Our computational pool was composed of 

• University of Wisconsin Condor pool workstations, 

• a University of Wisconsin dedicated Linux cluster, 

• a University of New Mexico "flocked" Condor pool, 

• the National Institute for Nuclear Physics, Bologna. Italy, "flocked" 
Condor pool, and 

• the Argonne National Laboratory SGI/ Origin2000 acquired via Globus 
through the glide-in mechanism. 

The pool is depicted in Figure 1 . 




Unux Cluster 
(Wisconsin). 



O O O O 

o o o o 
o o o o 
.0000 

Condpr Pool 
(Wisconsin) 



0000 
0000 
0000 
0000 
Condor Pool 
(New Mexico) 



SGi/02K 
(Argonne) 



0000 
0000 
0000 
0000 
Ccmdor Pool 
(Bologna, Italy) 



Figure 1: The Computational Pool. 

Figure 2 depicts the evolution of the number of machines of each type 
during our run. At 11:30AM a glide-in request was made for 32 SGI proces- 
sors on Argonne's 02K for a period of 12 hours. At 6:30 PM, the Condor 
scheduling daemon was reconfigured to allow flocking with the INFN Con- 
dor pool in Bologna, Italy. The job was stopped manually at 11PM, and 
we restarted it at 8AM from the master's checkpoint file, as explained in 
Section 3.1. 

Over the course of the computation, an average of 211.3 machines and 
with a peak of 285. The parallel efficiency obtained during the run was rj = 
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0.83. The average performance of the computational pool was 195 times the 
performance of one of the dedicated Linux nodes. Neglecting parallel search 
anomalies, the. solution of this problem in sequential would have required 
around over 177 days of computation with the sequential algorithm on a 
dedicated Linux node. The marriage of Grid resources with the advanced 
algorithm allowed the solution of a heretofore unsolved problem. 
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Figure 2: Number of Workers 



5 Conclusions and Future Work 

MW has allowed algorithm developers to bring together a large number of 
heterogeneous, geographically dispersed resources to solve extremely large 
problems. The simple API of MW provided a convenient programniing model 
enabling the user to focus on algorithmic features without worrying on the 
details of setting up computations, and the IPI has allowed a better porta- 
bility of the resulting code to different grid computing environments. 
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It is the continued goal of this work to draw further application devel- 
opers by providing a simple interface, access to Grid resources, and useful 
functionality at no expense to the application code. We also wish to entice 
Grid infrastnictufe developers to support MW by providing a simple, well- 
defined interface, and interesting and useful applications. Ther^ is still work 
to be done to turn these goals into realities. 

Further Information about MW is available from 

http : //wot . cs, wise . edu/condor/mw . 
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Abstract 

Selection of the most suitable nodes on a network to ex- 
ecute a parallel application requires matching the network 
status to the application requirements. We propose and val- 
idate a novel two step approach that exploits the knowledge 
of the communication structure of the application to address 
this problem. In the first step, a small set of candidate node 
groups are selected as potential sites of application execu- 
tion, by analyzing the network status information and the 
communication patterns used by. the application. The sec- ' 
ond step is based on the concept of a communication skele- 
ton, which is a short running program that generates the 
dominant communication operations of the application it 
represents. The communication skeleton is executed on all 
candidate groups of nodes. The node group selected for ap- 
plicatiof\ execution is the one that achieves the best perfor- . 
rriance on the communication skeleton. This' approach leads 
to customized node selection and is particularly well suited 
to situations where available network information is of poor 
quality or expected communication performance cannot be 
modeled accurately. We motivate this approach, describe 
a prototype implementation, and present performance re- 
sults for NAS Parallel Benchmarks executing on a shared 
network testbed. 



1. Introduction 

Selection of computation nodes to execute a parallel ap- 
' plication is a central problem for computing on shared clus- 
ters and computation grids [7, 8]. Node selection based on 
CPU considerations has been addressed by several systems, 
some well know examples being Condor [9] and LSF [20]. 
The problem of node selection is significantly more com- 
plex when the communication needs of applications must 
also be taken into account. The main reasons for the ad- 
ditional complexity are that the communication properties 
cannot be associated with individual nodes, network sta- 



tus changes dynamically, and the availability of network re- 
sources is difficult to measure and predict accurately. The 
state of the art in resource selection with communication 
considerations can be paraphrased as follows. The status 
of the network is measiired and predicted with tools such as 
Network Weather Service [1 9] and Remos [ 1 0, 1 1 ], and this 
information is analyzed to identify a good group of avail- 
able computation nodes and network paths for application 
execution. Some research projects have focused on getting 
the best general group of execution nodes [3, 14, 18] while 
others have developed procedures customized for a particu- 
. lar application or application class [4, 5, 6, 1 2]. 

This approach to node selection has the fundamental 
drawback that the decisions made are, at best, only as good 
as the accuracy with which the network status was mea- 
sured and future network performance predicted. The per- 
formance that a network delivers to an application can vary 
significantly from the performance predicted by network 
measurement tools for a variety of reasons, some of which 
are as follows: 

• Network status and prediction information niay be out- 
dated. Measurement of network properties, such as 
available bandwidth, can be intrusive and expensive, 
and the cost rises rapidly with the size of the network. 
Hence, it may be practical only to perform measure- 
ments relatively infrequently while the network state 
changes continuously. 

• Network tools typically measure the unused network 
capacity or the network bandwidth achieved by a spe- 
cific measurement probe. However, the relationship 

. between available network capacity and the perfor- 
i mance achieved by conununication operations in an 
application is complex, and depends on other fac- 
tors also, such as the network transport protocols in 
use. . For example, the bandwidth that a TCP stream 
achieves on a busy network route partly depends on 
the number of other TCP streams using the links in the 
route. 



• The performance of collective communication opera- 
tions, common in parallel applications, is very diffi- 
cult to estimate on a shared cluster or a grid environ- 
ment. We are not aware of any tools that have proven 
their effectiveness in this respect In particular, inter- 
ference between multiple application communication 
streams sharing the same network path, is very difli-. 
cult to model. 



The point is that the expected performance of an appli- 
cation's communication operations inferred from network 
measurement tools can be significantly different from. the 
actual performance for several different reasons. This lim- 
its the effectiveness of any node selection procedure entirely 
based on network measurements. 

This research pursues a new approach to node selection 
motivated by the above discussion. The centerpiece of our 
methodology is the concept of a performance skeleton of 
an application, which is defined as. a synthetically gener- 
ated short nmning program that has the same fund^en- 
tal execution characteristics as the application it represents, 
but with no semantic relevance. The execution time of the 
performance skeleton program on a given set of nodes re- 
flects the execution tirne of the application under the same 
. conditions, but possibly scaled down by multiple orders of 
magnitude. In this approach, the performance of the perfor- 
mance skeleton on a group of nodes determines the likeli- 
hood of those nodes being chosen for execution of the cor- 
responding application since the performance of the skele- 
ton is closely related to the performance of the application. 
This methodology eliminates the impact of inherent inac- 
curacy in network measurement and modeling. A perfor- 
mance skeleton is constructed ahead of time and executed 
prior to application execution to drive node selection. 

This paper addresses only a part.of the challenge of em- 
ploying a performance skeleton based approach to node se- 
lection. The results presented are restricted to sharing of 
commimication resources only. We assume that all available 
computation nodes have the same available computation ca-. 
pacity but communication properties of the network links 
connecting the nodes are vaiying. Hence, node selection is 
based on bandwidth considerations only. In this scenario, a 
performance skeleton needs to be faithful to the original ap- 
plication in terms of communication behavior only. Hence, 
in order to be more accurate, we will refer to them as com- 
munication skeletons in this paper. 

A communication skeleton is a short running program, 
and that is the key to keeping the run-time overhead of 
this approach acceptable. However, the nurnber of possible 
groups of nodes that are candidates for application execu- 
tion grows combinatorially with the total number of avail- 
able nodes. Hence, it is not practical to execute even a short 
running conununication skeleton on every candidate group 
of nodes. Therefore, we ernploy a separate procedure to se- 



lect a set of candidate node groups from all available nodes. 
This algorithm is based on the information about the net- 
work status obtained from network measurement tools and 
the information about the communication pattern of an ap- 
plication, which is computed in a preprocessing phase. The 
final group of execution nodes is selected based on the ex- 
ecution time of the performance skeleton program' on tfie 
candidate node groups. 

This paper is organized as follows. The node selection 
framework is described in section 2. Section 3 describes 
our prototype implementation and results from experiments 
to validate the node selection procedure. Section 4 explains 
the capabilities and limitations of our approach and imple- 
meiitation, and discusses ongoing and fiiture work. Section 
5 contains conclusions. 

2 Node selection framework 

We first outline the main steps and components of the 
node selection framework. 

The first two steps are performed ahead of time, once for 
each application. 

1: Identify the main communication patterns of the can- 
didate application. 

2. Construct the conununication skeleton of the applica- 
tion. 

The following subsequent steps are performed at the 
• time the application has to be scheduled for execution. 

3. Obtain current network status information. 

4. Identify a small set of candidate node groups for exe- 
cution by employing a node selection algorithm based 
on the network status and application's communication 
pattern. . 

5. Execute the communication skeleton program on each 
candidate group of nodes. Select the node group with 
the lowest execution time to schedule the application. 

We now discuss each of the above steps in more detail. 

2.1 IdentiiicatioD of communication pattern 

The node selection . framework relies heavily on the 
knowledge of the communication patterns in the applica- 
tion that has to be executed. These are captured by execut- 
ing the application in a preprocessing phase on a controlled 
testbed and monitoring the message traffic between nodes. 
The procedure has been discussed in detail in [13, 15] in a 
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related context. The methodology completely relies on sys- 
tem monitoring on the testbed while the application is exe-r 
. cuting and does not require application knowledge or access 
to the source code. The communication structure of NAS 
benchmark programs inferred from such runtime measure- 
ments is illustrated in Figure 2. 

2.2 Communication skeleton program 

The conununication skeleton of an application is a syn- 
thetically generated program that replicates the dominant 
conununication patterns employed by the application. As 
. stated earlier, the size and pattern of the messages ex- 
changed by the nodes executing an application are inferred 
by monitoring execution on a controlled testbed. Automatic 
construction of skeletons from this information is an impor- 
tant long term goal of our research. However, for the results 
presented in this paper, manually constructed communica- 
tion skeletons were employed. A program that performs a 
set of representative message exchanges along the. commu- 
nication routes used by the ^plication qualifies as a com- 
munication skeleton of Che application. 

23 Network status information 

Our network status measurement module employs Net- 
work Weather Service [19], a freely available distributed 
resource monitoring system. NWS gathers system level re- 
source information, such as CPU load and available band- 
width, for network connected compute nodes. We employ 
NWS to measure the available bandwidth between all nodes 
that can be used to execute an application. This step yields 
a graph with compute nodes as graph nodes and available 
bandwidth between them as graph edges. We will refer to 
this graph as the network status graph. 

2.4 Node selection algorithm 

The first step in the process of node selection is a pro- 
cedure that analyzes the network status graph to choose a 
set of "good" candidate groups of nodes for application ex- 
ecution. Another input to the node selection procedure is 
the application structure, basically the number of nodes re- 
quired to execute the application and pairs of nodes that 
communicate in the main data exchange pattems. The ob- 
jective of this algorithm is to determine the group of nodes 
for which the minimum of the available bandwidth between 
conmiunicating nodes is maximized. The reason for choos- 
ing this particular criterion is that the time for completion of 
a collective communication step in parallel programs is typ- 
ically determined by the lowest bandwidth communication 
path rather than the average available bandwidth on com- 
munication paths. 



This communication pattem based algorithm for node 
selection is presented in Figure 1. The algorithm is sim- 
ilar to the one that was introduced by Subhlok e:t. al. 
in [14], but with one important difference. The algorithm 
in Figure 1 attempts to optiraiize performance over network 
paths that are utilized by an application, while the algo- 
rithm in [14] treated all network paths connecting executing 
nodes equally. For Sample, in the algorithm in Figure 1 , if 
the main application communication pattem is an all to all 
data exchange between computing nodes, then the network 
path between each pair of nodes is optimized. However, if 
the main conununication pattem takes the form of a one di- 
mensional ring, then only the paths composing the ring are 
considered for optimization. 

We informally explain the node selection algorithm 
stated in Figure 1. Suppose the goal of the algorithm is 
to select m optimal nodes. The algorithm starts with the ■ 
network status graph and repeatedly removes the edge with 
the minimum available bandwidth from the graph. At every 
step, the algorithm verifies that there are m nodes that are 
connected in a way that satisfies the communication pattern 
of the application, (e.g., if the communication pattem is a 
ring, there must be a path from one node to another such 
that a ring can be completed.) A pathv^^from one node to 
another can include network routers but not other computa- 
tion nodes. When removing the minimum available band- 
width edge leads to a situation where m such nodes cannot 
be found, then the algorithm stops. The last step is reversed 
and a set of m nodes is selected. . . - 

The algorithm as presented in Figiu'e 1 selects a single 
group of optimal nodes, but our framework is based on se- 
lection of a set of candidate node groups. In practice, the 
algorithm is easily modified for usage in our firework by 
backtracking the last few edge deletions and selecting all 
feasible node groups at that point. 

2.5 Final node selection with communication , 
skeletons 

For final node selection, the conununication skeleton 
program is executed on each group of candidate nodes se- 
lected by the node selection algorithm described above. 
The group of nodes on which the communication skele- 
ton achieves the best perfomiance is selected for applica- 
tion execution. An important consideration in this step is. 
to not execute the communication skeleton concurrently on 
intersecting groups of nodes since execution on one group 
of nodes is likely to impact performance on other groups. 
Note that the conuiumication skeleton program is short run- 
ning, typically a few seconds long, and hence this stage is 
not likely to make a significant impact on the turnaround 
time of an application. 



Input: A connected network status graph G. An application pattern graph A with m 
compute nodes representing the number nodes needed by the application and the applica- 
tion communicating pattern. That is, there is an edge between a pair of graph nodes in A 
if the link between the corresponding application nodes is included in the main applica- 
tion communication pattem. Assume that the number of computed nodes in G is at least m. 

Output: A graph M containing m nodes that represents a mapping of the application to 
the compute nodes that maximizes the minimum bandwidth between any pair of commu- 
nicating nodes as represented in A. 

1. M^null 

2. Attempt to find a subgraph newM of .G such that there is a path between any two 
. nodes of neii;M if there is an edge between the corresponding nodes of A, If ho such 
graph exists, set neu;M - nttZf. 

3. If (newM null) 

return (A^ 

Else 

■ M ~newM 

4. Remove the edge with the minimum available bandwidth from G 
Goto Step 2. 



Figure 1. Algorithm to select a set of nodes in a network status graph in order to maximize the 
minimum available bandwidth between any palrof coinmunfcating nodes based on a given application 
communication pattem graph. . f-r- 



3 Experiments and results 

A prototype of the node selection framework discussed 
in this paper was implemented and validated on a network 
testbed We first describe the experiments performed and 
then discuss the results. 

3.1 Experimental setup 

The testbed for the experiments is a compute cluster 
composed of 10 Intel Xeon dual CPU 1.7 (GHz machines 
connected by lOOKfbps Ethernet links and a fiill crossbar 
switch. All experimental results are based on the MPI im- 
plementation of the NAS Parallel Benchmarks [2, 16]. The 
codes used are BT (Block Tridiagonal solver), CG (Con- 
jugate Gradient), IS (Integer Sort), LU (LU Solver), MG 
(Multigrad) and EP (Embarrassingly parallel). All pro- 
grams are compiled using GNU g77, (Fortran) compiler 
except IS, which is compiled with the gcc (C) compiler 
The MPICH implementation of MPI is used. The band- 
width between computation nodes was managed with the 
Linux advanced networking iproutei [1] in order to sim- 



ulate limited bandwidth availability due to competing net- 
work traffic. iproute2 works by intercepting the network 
packets and passing them through artificial queues to simu- 
late bandwidth limitations. 

3.2 Building communicatioii patterns and com- 
munication skeletons 

In order to make an ^plication **ready" for automatic 
node selection, the main communication patterns have to 
discovered and a conMnunication skeleton program has to 
be created in a preprocessing phase. For the NAS bench- 
mark programs included in this study, the basic communi- 
cation patterns were derived by execution on a dedicated 
testbed with system level monitoring of network traffic. We 
will skip the details of these measurements but they are dis- 
cussed in [1 5, 1 7]. The results are illustrated in Figure 2. 

The next objective is to constmct the communication 
skeletons. The NAS benchmarks are available in several 
sizes labeled Class S,W,A,B,and C, in increasing order of 
the size of data structures and execution time. Class S 
benchmaiks run within a few seconds on a small cluster. 
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Figure 2. Dominant communication patterns . 
during execution of NAS benchmarks. Tlie 
thiclcness of the lines reflects the generated 
communication bandwidth. . 



while Class C benchmarks require a fairly large system to 
nin at an acceptable speed. We chose class A.benchmarics 
as the target applications to optimize. We also chose the 
corresponding class S benchmarks as the communication 
skeletons for the class A benchmarks, since they closely re- 
semble each other and are likely to be very good skeleton 
programs. Our longer term goal in this research is to auto- 
matically construct performance skeletons. So, clearly, we 
are "cheating" by simply using a good skeleton program 
that happens to be available in this case. The reason is that 
we did not want the results to be impacted by the quality of 
the skeletons that we constructed since automatically build- 
ing good skeletons is an open research problem that is not 
the focus of this paper. Hence, the results we obtained could 
be labeled as optimistic. However, based on other ongoing 
research, it is our firm belief that, in the near future, it will 
be possible to automatically generate skeletons of the qual- 
ity that we have used for our experiments. 

33 Automatic node selection 

In order to evaluate node selection in the presence of 
network traffic, experiments were performed with varying 
available bandwidth caused by simulated network traffic. 
The available bandwidth on the network links connecting 
the computation nodes was controlled in the following man- 
ner. At any given time, every network link was assimned to 
be shared by a varying number of other traffic streams. If S 
streams are already sharing a network link, the bandwidth 
available to our application with fair sharing is assumed to 
be 1/(5 +1). Every 30 seconds, one traffic stream would 
randomly enter or leave the system, with a resultant increase 
or decrease in the available bandwidth on the affected link. 
The bandwidth, however, was never allowed to go below 



10Mbps and cannot exceed the link capacity of 100Mbps. 
Based on the above simulation model, the actual bandwidth 
was controlled with the iproute2 toolset. 

Each NAS benchmark program was executed repeatedly 
on 4 nodes selected by our prototype node selection module 
based on the framework presented. NWS was employed to 
measure the available bandwidth between pairs of compute 
nodes on the network and build a network status graph. The 
node selection algorithm presented in Figiire 1 was used to 
select the best three groups of nodes every time a bench-^ 
mark program had to be scheduled and executed. Subse- 
quently, the corresponding conmiunication skeleton was ex- 
ecuted on each of the three groups of nodes, and the group 
on which it performed the best was the selected node group. 
The benchmark program was then executed on those nodes 
and the execution time was measured and compared to the 
execution time on a dedicated testbed. For comparison, the., 
procedure was also performed with two other node selec- 
tion methods. The three node selection procedures that were 
evaluated and compared against each other are as follows: 

1. Pattern based: The framework presented in this pa- 
per. 

' 2. All-all: The nodes were selected using the network 
information, on the basis of maximizing the mini- 
mum available bandwidth between any pair of selected 
nodes. This approach requires a detailed analysis of 
the network status graph, but does not use any applica- 
tion ^cilic information such as die coimnunication 
pattern, and does not employ conmiunication skele-- 
tons. 

3. Random: Nodes were selected at random for refer- 
ence. 

The perfonnance achieved by the benchmark programs 
on nodes selected by each of these methods was measured 
The experiments were repeated a large number of times to ' 
get statistically meaningful results. For each benchmark 
program, the average execution time with each node selec- 
tion procedure was computed and compared to the execu- 
tion time of the same benchmark on a dedicated testbed with 
full bandwidth available on all links. The average slowdown, 
due to link sharing for each benchmark program and each 
node selection procedure is presented in Figure 3. 

3.4 Results 

We observe from Figure 3 that each benchmark program 
performs significantly better when pattern based node se- 
lection is employed as compared to random node selection. 
On average, the percentage slowdown with random node se- 
lection is 40%, while that with pattern based node selection 
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• is around 20.2%. Hence, under these particular simulated 
conditions, the slowdown due to competing network traf- 
fic is reduced by half with good, application specific, node 
selection. 

The node selection procedure labeled "all-all" can be 
considered a state of the art approach to node selection, but 
one that- does not use the new concepts introduced in this 
paper. In the all-all method, the basis for node selection is 
maximizing the minimum available bandwidth between ev- 
ery pair of selected nodes, as described in [14J. The method 
entails a detailed analysis of the network stanis graph, but 
no consideration is given to the application communication 
structure, and communication skeletons are not used. The 
average slowdown with all-all node selection is 27.6% ver- 
sus 20.2% for the pattern based framework. Hence, in our 
experiments, the pattern based approach to node selection 
reduces the slowdown due to link sharing by roughly a quar- 
ter as compared to a good approach that does not consider 
application conununication patterns. 

We observe that the general percentage slowdown as 
well as the relative perfonnance with different node selec- 
tion procedures varies dramatically across the programs in 
the NAS benchmark suite. EP benchmark is not included in 
the graph as it has no communication, and hence its perfor- 
mance is unaffected by the changes in available bandwidth 
and does not depend on the node selection procedure em- 
ployed. The CG and IS benchmarks show the greatest per- 
centage increase in execution time with random node selec- 



tion. We observe from Figure 2 that these benchmarks are 
the most bandwidth hungiy of the suite, which is the rea- 
son they are most affected by a reduction in the available 
bandwidth. 

The pattern based scheme perforins better than the ran- 
dom and all-all schemes for every application but there are 
significant differences. The maximum improvement in per- 
formance is observed for the CG benchmark. We speculate 
that the reason is that only 3 pairs of nodes communicate 
in CG as shown in Figure 2. A smMt node selection proce- 
dure has a better chance offending a relatively small num- 
ber of "good" network paths as compared to finding good 
paths between every pair of selected nodes. This translates 
to finding 3 good paths versus 6 good paths for 4 nodes. 
Further, as mentioned earlier, CG is among the most com- 
munication intensive programs in the suite, and therefore, 
its performance is most sensitive to network path selection.' 

Another observation is that the relative improvement 
with pattem based node selection, as compared to all-all 
node selection, is lowest for IS and BT benchmarks. This is 
not surprising since the main communication pattem in IS 
and BT benchmarks is an all to all data exchange. Hence, 
the analysis of network status graph js identical for pattern 
based and all-all procedures and the difference is only due 
the use of communication skeletons in the pattern based 
framework. 

The broad conclusion from these experiments is that the 
pattem based approach to notfe selection offers considerable 
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improveinetit in expected performance over random node 
selection and all-all node selection procedures, but the ex- 
tent of improvement is strongly dependent on the applica- 
tion characteristics. We should also caution that these are 
limited experiments and the results will also strongly de* 
pend on the network and system characteristics. 

4 Discussion 

This research employs application characteristics to 
drive the process of automatically selecting network nodes 
to execute an application. Specifically, we suggest a two 
step process for node selection. In the first step, a network 
status map is matched to the application communication 
stmcture to obtain a set of potential node groups for execu- 
tion. Clearly, better node selection decisions can be made if 
the procedure is sensitive to application characteristics. In 
the second step, an application communication skeleton is 
executed on every candidate group of nodes to decide which 
group offers the best potential performance. This step is in- 
tended to eliminate the various inaccuracies in estimating 
the communication performance an application can expect 
on a group of network nodes. 

The focus of this paper has been entirely on communica- 
tion characteristics. We only consider variations in network 
availability and base node selection on communication ca- 
pacity.. In practice, computation and synchronization con- 
siderations are equally important Computation nodes may 
have competing toads, and impact of a slowdown in one 
node in the system may get nriagnified because of data and 
control dependencies. In related work, we have addressed 
the problem of performance estimation with shared nodes 
and links [17]. However, elfective integration and valida- 
tion of these techniques in a node selection system remains 
a challenge. 

We have only employed conmiunication skeletons, that 
are a special case of performance skeletons. Construction 
of complete performance skeletons also includes computa- 
tion and synchronization considerations. Perhaps the most 
critical limitation of our system is that the communication 
skeletons have to be constructed manually: It is not difr 
ficult to automatically construct a program that concisely 
reproduces the measured communication pattern of an ap- 
plication. However, our real goal is automatic construction 
of general performance skeletons. A performance skeleton 
should mirror the application it represents, in all respects. 
For example, the computation to communication ratio, syn- 
chronization patterns, memory access patterns, fraction of 
different types of instructions, message exchange patterns, 
should all be closely correlated between an application and 
its performance skeleton. The goal is that the relative be- 
havior of the application and performance skeleton should 
be similar under any computation environment and under 



all network conditions. And yet the performance skeleton 
is expected to execute for a very short time. Note that a per- 
formance skeleton cannot be just the beginning part of the 
application itself since application behavior changes over 
time and the performance skeleton is expected to capture 
the cumulative application activitieis over the full duration 
of execution. Clearly, automatically constructing perfor- 
mance skeletons is a major challenge, and it is also a key 
long-term goal of this research. This paper focuses only on 
demonstrating the value of performance skeletons for appli- 
cation scheduling. 

The prototype implementation and results described in 
this work are essentially a ''proof of concept". We point out 
the most significant limitations of our implementation and 
experiments. The prototype node selection tool automati- 
cally determines the best nodes for execution and scheduled 
the application on those nodes - However, some of the steps 
in the preprocessing of the applications, to enable them for 
automatic node selection, are manual. We have conducted 
experiments on a small compute cluster with the bandwidth 
controlled to simulate network sharing. More experimen- 
tation on larger clusters and grid environments is neces- 
sary to evaluate this approach rigorously. Our system cur- 
rently works only for MPI message passing applications but 
is not fundamentally limited to any prdgranaming model. 
The NAS benchmark programs used in this research rep- 
resent a variety of applications in parallel computing, but 
each benchmaric focuses on a single core scientific algo- 
rithm. Full applications, in contrast, often employ multir 
pie different types of computations in different phases. This 
certainly adds additional complexity to node selection that 
is not evaluated in this work. Overall, we believe that our 
results are relevant and meaningful, even though there is 
significant room for more experimentation and better eval- 
uation arid validation. 

5 Conclusions . * 

This paper makes a case for employing application 
knowledge to node selection in shared cluster and grid en- 
vironments. We demonstrate how the communication pat- 
tern of an application is exploited to discover good compute 
nodes and network paths for execution. One of the. major 
problems in automatic node selection for network environ- 
ments is the cost and accuracy of network usage informa- 
tion. We propose application conuhunication skeletons as 
our solution approach. With the use of this method, approx- 
imate network information is used to get good candidate 
node groups for execution, and actual execution of skeletons 
is used to make the final choice. This largely eliminates the 
potential for poor choices due to inaccurate network infor- 
mation since a small slice of actual execution is performed 
before assignment of nodes to an application. 
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We have developed a prototype node selection frame- 
work and present results from a network testbed that simu- 
lates varying bandwidth availability between pairs of nodes 
. available for execution. The results clearly demonstrate that 
node selection based on this framework is a large improve- 
ment over random node selection, and also a clear improve- 
. ment over state of the art methods that do not employ appli- 
cation knowledge. While our prototype implementation and 
experiments are limited in scope, they clearly demonstrate 
the potential of automated node selection and scheduling 
based on an application's conununication pattern. This pa- 
per is a significant step towards general resource schedul- 
ing that employs broader application''knowledge including 
computation and synchronization information and general 
performance skeletons. 
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Abstract 

Job Management Systems (JMSs) efficiently schedule and 
monitor jobs in parallel and distributed computing 
environments. Therefore, they are critical for improving 
the utilization of expensive resources in high-performance 
computing systems and centers, and an important 
component of grid software infrastructure. With many 
JMSs available commercially and in the public domain, it 
is difficult to choose an optimum JMS for a given 
• computing environment. In . this paper, we present the 
results of the first empirical study of JMSs reported in the 
literature. Four commonly used systems. LSF, PBS Pro, 
Sun Grid Engine / CODINE, and Condor were 
considered. The study has revealed important strengths 
and weaknesses of these JMSs under different operational 
conditions. For example, LSF was shown to exhibit 
excellent throughput for a wide range of Job types and 
submission rates. On the other, hand, CODINE appeared 
to outperform other systems in terms of the average turn* 
around time for small jobs, and PBS appeared to excel in 
terms of turn-around time for relatively larger Jobs. 

1. Introduction 

A ]ot of work has been done in grid software 
infrastnicture. One of the major tasks of this infrastructure 
is job management, also known as workload management, 
load sharing, or load management Software systems 
capable of performing this task are referred to as Job 
Management Systems (JMSs). 

Job Management Systems can leverage under-utilized 
computing resources in a grid computing like style. Most 
JMSs can operate in multiple environments, including 
heterogeneous clusters of workstations, supercomputers, 
and massively parallel systems. The focus of our study is 
performance of JMSs in a loosely coupled cluster of 
heterogeneous workstations. 

Taking into account the large number of JMSs 
available commercially and in public domain, choosing 
the best JMS for particular type of distributed computing 
environment is not an easy task. All previous comparisons 
of JMSs reported in literature had only a conceptual 
character; In [1], selected JMSs were compared and 
contrasted according to a set of well defmed criteria. 



In [2, 3^4], the job management requirements for the 
Numerical Aerodynamic Simulation (NAS) parallel 
systems and clusters at NASA Ames Research Center 
were analyzed and several commonly used JMSs 
evaluated according to these criteria. In [5,6,7], three 
widely used JMSs were analyzed from the point of view 
of their use with Sun HPC Cluster Tools. Finally, our 
• earlier conceptual study, reported in t7>8,9]. , gave a 
comparative overview and ranking of twelve popular 
systems for distributed computing, including several 
JMSs. 

In this paper, we extend the conceptual comparison 
with the empirical study based on a s^** of well defined 
experiments performed in a uniforrn fashion in a 
controlled computing environment To our best 
knowledge, this is a first reported experimental study 
quantifying the relative performance of several Job 
Management Systems. 

Our paper is organized as follows. In Section 2, we 
give an introduction to Job Management Systems, and 
summarize conceptual functional differences among them. 
In Section 3, we define metrics used for comparison, 
present our experimental setup, and discuss parameters 
and role of all experiments. In Section 4, we describe our 
methodology and tools used for the measurement 
collection. Finally, in Sections 5 aiid 6, we present 
experimental results, their analysis, and we draw 
conclusions regarding the relative strengths and 
weaknesses of investigated JMSs. • 

2. Job Management Systems 

2.1. General architecture of a JMS 

The objective of a JMS» for an environment 
investigated in this paper, is to let users execute jobs on a 
non-dedicated cluster of workstations with a minimum 
impact on owners of these workstations by using 
computational resources that can be spared by the owners. 
The system should be able to perform at least the 
following tasks: 

a. monitor all available resources, 

b. accept jobs submitted by users together with resource 
requirements for each job, 
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Figure 1. Major functional blocks of a Job 
Management System 

c. perform centralized job scheduling that inatches all 
available resources with all submitted jobs according to 
the predefined policies [10], 

d. allocate resources and initiate job execution, 

e. monitor all jobs and collect accounting information. 
To perfonn these basic tasks, a JMS must include at 

least the following major fbnctional units shown in 
Fig. 1: . 

1. User server - which lets user submit jobs and Aeir. 
requirements to a JMS (task b), and additionally may 

• ^llow die user to inquire about the status and change 
the status of a job (e.g., to suspend or terminate it). 

2. Job scheduler - which performs job scheduling and 
queuing based on the resource requirements, resource 
•availability, and scheduling policies (task c). 

3. Resource manager - used to monitor resources and 
dispatch jobs on a given execution host (tasks a, d, e). 

2.2. Choice of Job Management Systems 

More than twenty JMS packages, both commercial and 
public domain, are currently in use [1, 7]. For interest of 
time we selected four- representative and conuitonly used 
JMSs 

- • LSF - Load Sharing Facility 

• PBS - Portable Batch System 

• Sun Grid Engine / CODINE, and 

• Condor 

The common feature of these JMSs is that all of them are 
based on a central Job Scheduler running on a single 
node. 

LSF (Load Sharing Facility) is a commercial JMS from 
Platform Computing Corp. [11,12]. It evolved from 
Utopia system developed at the University . of Toronto 
[13], and is currently probably the most widely used JMS. 

PBS (Portable Batch System) has both a public domain 
and a commercial version [14]. The commercial version 
called PBS Pro is supported by Veridian Systems. This 



version was used in our experiments. PBS was prigirially 
developed to manage aerospace computing resources at 
NASA Ames Research Center. 

Sun Grid Enginc/CODINE is an open source package 
supported by Sun Inc. It evolved from DQS (Distributed 
Queuing System) developed by Florida State University. . 
Its conrunercial version called CODINE was offered by 
G£NL\S Gmbh in Germany and became widely deployed 
in Europe. . 

Condor is a public domain software package diat was 
started at University of Wisconsin. It was one of the first 
systems that utilized idle workstation cycles and supported 
checkpointing and process migration. 

2.3. Functional similarities and differences 
among selected Job Management Systems 

The most important functional characteristics of 
selected four JMSs. are presented and contrasted in Table 
1. From this table, it can be seen that LSF supports all 
operating systems, job types, and features inchided in the 
table. CODINE lacks support for Windows NT, stage-in 
and stage-out, and checIq}ointing. PBS and Condor trail 
LSF and CODINE in terms of support for parallel jobs, 
dynamic load balancing and master daemon fault 
recovery. They also support a smaller tiumber of operating 
systems compared to LSF. 

3. Experimental Setup 

3.1. Metric 

The following performance measures were investigated 
in our study: 

1, Throughput is defined in general as a number of jobs 
completed in a unit of time. Since this number depends 
strongly on how many jobs are taken into account, we 
consider throughput to be a function of the number of 
jobs, ky and define it as ^: divided by the amount of time 
necessary to complete k JMS jobs (see Fig. 2a). We also 
define toial throughput as a special case of throughput for 
parameter k equal to the total number of jobs submitted to 
a JMS during the experiment, (see Fig. 2b). 

In Fig. 3, we show the typical dependence of 
throughput on the number of jobs taken into account, k. It 
can be seen that throughput increases sharply as a function 
of k until the moment when either all system CPUs 
become busy, or the number of jobs ..submitted and 
completed in a unit of time beconrie equal. When the 



Table 1. Conceptual functional comparison of selected Job Management Systems 





LSF 


CODINE 


PBS 


Condor 


Distribution 


commercial 


public domain 


commercial and 
public domain 


public dohiain 


OperatioE System Support 


Linux, Solaris 


yes 


yes 


yes 


yes 


Tni64 


yes 


yes 


yes 


no 


Windows NT 


yes 


no 


no 


partial 


Types of Jobs 


Interactive jobs 


yes 


. yes 


yes 


no 


Parallel jobs 


yes 


yes 


partial 


limited to PVM 


Features Supporting Eflidency, Utilization, and Fault Tolerance 


Stage-in and 
stage-out 


yes 


no . 


yes 


; yes 


Process migration 


yes 


yes 


no 


• yes 


Dynamic load 
balancing 


yes 


yes 


no 


no 


Checkpointing 


yes 


using external 
libraries 


only kemcMevel 


yes 


Daemon fault 


master and execution 


master and execution 


only for execution 


only for execution 
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hosts 


hosts 
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Figure 2. Definition of (a) throughput and (b) total 
throughput 

number of jobs taken into account, k, gets close to the 
total number of jobs submitted during the experiment, the 
throughput drops sharply and unpredictably. This drop is 
the result of a boundary effect and is not likely to appear 
during the regular operation of a JMS, when the flow of 
jobs submitted to the JMS continues iminterrupted for a 
long period of time. Therefore, we decided to use for 
comparison 




Figure 3. Throughput as a function of numl>er of Jobs 
taken Into account 
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Figure 4. Definition of timing parameters 
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Figure 5.. Definition of the system utilization and Its 
measurenient using top 

average throughput, defined as throughput averaged over 
all possible values of the job nunij>er, 

2. Average turn-around time is the time fh)m submitting 
a job till completing it, averaged over all jobs submitted to 
a JMS (see Fig. 4). 

3. Average response time is the average amount of time 
between submitting a job to a JMS and starting the job on 
one of the execution hosts (see Fig. 4). 

4. Utilization is the ratio of a busy time span to the 
available time ^an. In our experiments, we measured the 
utilization by measuring tiie average percentage of the 
CPU time used by all JMS jobs.on each execution host. 
These average machine utilizations were then averaged 
over all execution hosts (see Fig. S). 

3.2. Our Micro-grid testbed 

A Micro-grid testbed used in our experiments is shown 
in Fig. 6. The testbed consists of 9 PCs running Linux OS, 
and 4 workstations Ultra 5, running Solaris 8. 
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Figure 6. A Micro-grid testbed used in the 
experimental study 



The total number of CPUs available in the testbed is 20. 
The network structure of the testbed is flat, so that every 
machme can serve as both an execution host and a 
submission host. In all our experiments, pall j was used 
as a master host for all Job Management Systems. All 13 
hosts, including the master host, were configured as 
execution hosts. In all our experiments, pallj was also, 
employed as a submission host 

3 J. Application benchmarks 

A set of 36 benchmarks has been compiled and 
installed on all machines of our testbed. These programs 
belong to the following four classes of benchmarks: NSA 
HPC Benchmarks, NAS. Parallel Benchmarks, UPC 
Benchmarics, and. Cryptographic Benchmarlcs. Each 
benchmark has been characterized in tierms of the CPU 
time, wall time, and memory requurements tising one of 
the Linux machines. 

All benchmarks have been divided into the following 
three sets of benchmarks: 

1. Set 1 Short job list - 16 benchmarks with an 
execution time between 1 second and 2 minutes, and 

• an average execution time equal to 22 seconds. 

2. Set 2 - Medium job list - 8 benchmarks with an 
execution time between 2 minutes and 10 minutes, and 
an average execution time equal to 7 minutes 22 
seconds. . - 

3. Set 3 - Long job list - 6 benchmaiks.with an execution 
time between 10 minutes and 31 minutes, and an 
average execution time equal to 16 minutes 51 
seconds. 

3.4. Experiments 

Each experiment consists of running N .jobs chosen 
pseudo-randomly from the given set of benchmarks, and 
submitted one at a time to a given JMS in the pseudo- 
random time intervals. All jobs were submitted from the 
same machine, pallj, and belonged to a single user of 
the system. The rate of the job submissions was chosen to 
have a Poisson distribution. 

The only job requirement specified during the job 
submission was the amount of available memory. No 
information about the expected execution time, or limits 
on the wall or CPU time were specified. 

The total number of Jobs submitted to a system, N, was 
chosen based on the expected total time of each 
experiment, the average execution time of jobs from the 
given list, and the number of machines in oiu* testbed. In 
Experiments 1, 3, 4, and 5, regarding short and mediiun 
job lists, the total number of jobs was set to 150, which 
led to a total experiment time of about two hours. In 



Experi-ment 
Number 


Benchmark Set 


Average 
CPU time /Job 


Average Time 
Intervals Between 
Job Submissions 


Total 
Number 
of Jobs 


Special Assumptions 


1 


Set 2, Medium 
job list 


7min22s 


30 s, 15 s, 5 s 


150 


one job /CPU 


2 


Set 3. Long job 
list 


16min51s 


2 min, 30 s 


75 . 


onejob/CPU . 


3 


Set 1, Short job 

list 


22 s 


15s,t0s.5s,2s»l 
s 


150 


one job /CPU 


4 • 


Set 2, Medium 
job list 


7 min 22 s . 


15 s 


150 


twojobs/CPU 


5 


Sett, Short job 
list 


22 s 


5 s 


. 150 


onejob/CPU; 
emulation of daemon 
faults 



Experiment 2, regarding the long job list, the , total 
number of jobs was reduced to 75 to keep the time of 
each experiment within the range of 2 hours. 

Each experiment was repeated for four JMSes,. under 
exactly the same initial conditions, including the same 
initial seeds of the pseudo-random generators. 
Additionally, all experiments were repeated 3 times for 
the same JMS to minimize the effects of random events 
in all machines participating in the experiment 

Additionally, each experiment was. repeated for 
. sevjsral different average job submission rates. These 
rates have been chosen experimentally in such a way that ' 
they correspond to qualitatively different JMS loads. For 
the smallest submission rate, each system is very lightly 
loaded. Only a subset of all available CPUs is utilized at 
any point in time. Any new job submitted to the system 
can be immediately dispatched to one of the execution 
hosts. For the highest submission, rate, a majority of 
CPUs are busy all the time, and almost any new job 
subtiiitted to a JMS must spend some time in a queue 
before being dispatched to one of the execution hosts. 

The characteristic features of five experiments 
performed during our experimental study are 
summarized in Table 2. Experiments 1-4 were designed 
to measure the perfomiance of each JMS for different 
job submission rates. 

Experiment 5 was aimed at quantifying fault 
tolerance of each JMS by determining its resistance 
against the master and execution daemon failures. Five 
minutes after the beginning of this experiment, master 
daemons on a master host or execution host daemons on 
a single execution host were killed. In one version of the 
experiment, the killed daemons were restarted one 
minute later, in the other version, no further action was 
taken. In all caises, the total number of jobs that 
completed execution was recorded and compared with 
the total number of jobs submitted to the system for 
execution. 



3.5 Common settings of Job Management 
Systems 

An attempt was . made to set all JMSes. to an 
equivalent configuration, using the following major 
configuration settings: 
Ai Maximum Number of Jobs per CPU 

In all experiments,. except Experiment 2, a maximum 
number of jobs assigned simultaneously to each CPU 
was set to one. In other words, no timesharing of CPUs 
was allowed. This setting was chosen .as an optimum 
because of the numerical character of benchmarks used 
in our study. All benchmarics from the short, medium, 
and long job lists have no input or output. For this kind 
of benchmarks, timesharing can improve only the 
response time, but has a negative effect on two most 
important performance parameters: tiim-around time and 
throughput. This deteriorating effect of timesharing was 
clearly demonstrated in our Experiment 4, \^ere two 
jobs were allowed to share the same CPU. 
B. CPU factor of execution hosts 

The CPU factors determine the relative performance 
of execution hosts for a given type of load. Based on the 
recommendations given in the JMS manuals, CPU 
factors for LSF and CODINE were set based on the 
relative performance of benchmarks representing a 
typical load. For each list .of benchmarks, . two 
representative benchmarks were selected, and run on all 
machines of distinctly different types. The CPU factors 
were set based on an average ratio of the execution time 
on the slowest machine to the execution time on the 
machine for which the CPU factor was determined. 
Based on this procedure, the slowest machine had 
always a CPU factor equal to l.O. The CPU factors of 
remaining machines varied in the range from 1.2 .to 1.7 
for a small job list, and from 1.4 to 1.95 for the medium 
and long job lists. The CPU factors of Condor were 
computed automatically by this JMS based on the 
Condor-specific benchmarks running on the execution 
hosts in the spare time. The CPU factors in LSF, 



• CODINE, and Condor affect the operation of. the 
scheduler. In PBS, the equivalent parameter has no 
effect on scheduling, and affects only accounting and 
time limit enforcenient 
C Dispatching interval 

The dispatching' interval determines how often the 
JMS scheduler attempts to dispatch pending jobs. This 
parameter clearly affects an average response time, as 
well as scheduler overhead. It may also influence die 
remaining performance parameters. 

LSF on one side, and PBS,. CODINE, and Condor on 
the other side iise a different definition of this parameter. 
In all systems, this parameter describes the maximum 
time in seconds between subsequent attempts to 
schedule jobs. However, in PBS, CODINE..and Condor 
the. atten^ts to schedule a job also occur whenever a 
new job is submitted, and whenever a iiuming batch job 
terminates. The same is not the case for LSF. On the 
other hand, LSF has two additional parameters that can 
be used to limit the time spent by . the job in the queue, 
and thus reduce the response time. 
F. Scheduling policies 

No changes to the parameters describing scheduling 
policies were made, which means that the default First 
Come First Serve (FCFS) scheduling policy was used for 
all systems: One should be however aware that within 
this policy, a different ranking. of hosts fiilfilling the job 
reqiiirements might be used by different JMSs. 

4. Methodology and measurement collection 

Each experiment was aimed at determining values of 
all performance measures defined in Section 3.1. All 
parameters were measured in the same way for all JMSs, . 
using utilities and mechanisms of the operating systems 
only. 

In particular, timestamps generated using the C 
function gettimeofday () , were used to determine 
the exact time of a job submission, as well as the begin 
and end of the execution time. The function 
gettimeofdayO gets the current time from the 
operating system. The time is expressed in seconds and 
microseconds elapsed since Jan I, 1970 00:00 GMT. 
The actual resolution of the returned time depends on 
the accuracy of the system clock, which is hardware 
dependent. 

The Unix Network Time Protocol (NTP) was used 
to synchronize clocks of all machines of our Micro-grid. 
The protocol provides accuracy ranging • from . 
milliseconds (on LANs) to tenths of milliseconds (on ■ 
WANs). 
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Figure 7. Software used to collect performance 
measures 

In order to determine the JMS utilization, the Unix 
top utility v^ras used. This utility records an 
approximate percentage of the CPU time used by each 
process running on a single machine averaged over a 
short period of time, e.g., 15 seconds (see Fig. 5). For 
each point in time, the sum of percentages corresponding 
to all JMS jobs is computed. The^e sums are then 
averaged over the entire duration of . an experiment, to 
determine an average utiliziition of each machine by all 
■ JMS jobs The execution host utilizations averaged over 
all execution hosts determine die overall utilization of a 
JMS. 

Three programs were developed to support the 
experiments and were used in a way shown in Fig, 7. A 
C++ Job Submission program has been written to 
emulate a random submission of jobs from a given host. 
This program takes as an input a list of jobs, a total 
number of submissions, an average interval between two 
consecutive submissions, and the name of a JMS used in 
a given experiment Two .post-proccfssing Perl scripts. 
Timing and Utilization post-processing utilities, have 
been developed to process log files generated by 
benchmarks and the top utility. These scripts generate 
exhaustive reports including values of all performance 
measures separately for every execution host, and jointly 
for the entire Micro-Grid testbed. 

5. Experimental Results 

Two most important parameters detennining the 
performance of a Job Management System are turn- 
around time and throughput. Throughput is particularly 
important when a user submits a large batch of jobs to a 
JMS and does not do any further processing till all jobs 
complete execution. Turn-around time is particulariy 
important when a user 
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Figure 8. Average throughput and average tum-around-time for the medium job list 
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Figure 9. Average throughput and average tum-around-tlme for the long job list 
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Figure 10. Average throughput and average turn-around-time for the short job list 
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Figure 11. Average partial throughput and response time for medium Jobs with two Jobs allowed to share a single 

CPU 
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tends to work in a pseudo-interactive mode and awaits 
results of each subsequent experiment. Additionally, an 
average response time might be important in case of 
interactive jobs that require user input and thus a constant 
presence of the user. 

In Experiment 1, with the medium job list, LSF and 
PBS were consistently the best in terms of the average 
throughput, while PBS was the best in terms of the 
average tum*around time. The overall differences among 
all four systems were smaller than 21% for average 
throughput, and 33% for average turn-around time. 

In Experiment 2, with the long job list, the throughputs 
of LSF, Condor, and PBS were almost identical, while 
CODINE was trailing by 29%. The tum-around time was 
the best for PBS and LSF for both investigated 
subinission rates. For the higher submission rate, the 



performance of Condor, was approximately the same as 
performance of two best systems. 

In Experiment 3, for short job list, the throoighputs of 
LSF and Codine were the highest of all investigated 
systems. At the same time, the tum>aroimd time was 
consistently the best for CODINE. 

The analysis of the system utilization and job 
distribution revealed the following rdisons for the 
. different relative performance of each system in terms of 
throughput and turn-around time. LSF tends to dispatch 
jobs to all execution hosts, independently of their relative 
speed (see Fig. 12a). It also uses a complex algorithm for 
scheduling, which guarantees thai jobs are executed 
tightly one after the other. Both factors contribute to high 
throughput At the same time, distributing jobs to all 
machines, including slow ones, increases average 
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execution time, and complex scheduling affects average 
response time. Both factors contribute to the increase in 
the average tum-around time. On the other hand, PBS 
distributes jobs only to a limited number of . the fastest 
execution hosts (see Fig. 12b). As a result, the average 
execution time is smaller. conq)ared to' LSF, which 
contributes to a better average tum-around tinne. At the 
same time, the limited utilization of the execution hosts 
contributes to only average throughput Finally, Condor 
has a scheduling algorithm that seems to be more suitable 
for longer jobs. Although all execution hosts are utilized, 
nevertheless, there are significant gaps between the times 
when one job finishes execution and another job is 
dispatched to the same execution host. 

In Experiment 4, the configuration parameters of each 
system were changed in order to allow two JMS jobs to 
execute on each execution host at the same lime. Since all 
jobs used in our study are numerical, and have limited 
input/output, timesharing of jobs could not improve either 
average throughput or average tum-around time. In fact, 
these parameters (deteriorated by a factor of 7 to 18%, 
depending on the JMS, because of the redundancy 
associated with context swapping. The only parameter that 
improved was the average response time that decreased by 
a factor ranging from 1 .6 to 1 .9 times. 



In Experiment 5 (see Table 3), LSF was shown to be 
resistant against the- master daemon failure. When the 
master daemons were killed, they were automatically 
restarted shortly after. No jobs were lost PBS and 
Condor, do not have the capability to restart killed master 
daemons but they resume normal operation after* these 
daemons are restarted manually. The interruption has a 
limited effect on the number of jobs that complete 
execution. All JMSs appeared to be resistant against the 
failure of execution host daemons. These daemons were 
not automatically restarted, but as a result of their failure, 
affected jobs were redirected to other execution hosts. 

6. Summary and Conclusions 

The summary of the performance of all investigated 
Job Management Systems a^ a function of the job size and 
the job subnaission rate is given m Tables 4 and .5. 

Table 4. JMS ranking in terms of the average 
throughput * summary (B - relatively best 



Job size 


Submission rate 




Low 


Medium 


Hieh 


Large 


B: LSF, PBS 
W: CODINE 
(-14%) 


B: Condor, 
LSF, PBS 
W: CODINE 
(-29%) 




Medium 


B: LSF, PBS 


B: LSF, PBS 


B: PBS. LSF 




W: CODINE 


W: CODINE 


W: CODINE 




(-21%) 


(.11%) 


(-17%) 


Small 


B: LSF, 


B: LSF, 


B: LSF, 




CODINE 


CODINE 


CODINE 




W: Condor 


W: Condor 


W: Condor 




(.57%) 


(-69%) 


(-71%) 



Table 5. JMS ranking in terms of the average turn- 
around time • summary (B • relatively best 



Job size • 


Submission rate 




Low 


Medium 


Hieh 


Large 


B: PBS, LSF 
W: Condor, 
CODINE 
(+78%) 


B: PBS, LSF 
W: CODINE 
(+57%) 




Medium 


B: PBS 


B: PBS 


BiPBS 




W: CODINE 


W; CODINE 


W: CODINE 




(+31%) 


(+37%) 


(+33%) 


Small 


B: CODINE 


B: CODINE 


B: CODINE 




W: Condor 


W: PBS 


W: LSF 




(+72%) 


(+119%) 


(+275%) 



Based on Tables 4 and 5, and Figures 8-11. we can 
draw the following conclusions. For large jobs with 
medium submission rate. Condor has compared favorably 
with the rest of the systems. In terms of the average 
system throughput, LSF appears to offer the best 



performance for all job sizes and submission rates. In 
terms of the average tum-around time, PBS is the best for 
large and mwliuin jobs, but CODINE outperfonns it for 
short jobs, . . 

The relative performance of Job Management Systems 
was similar for medium and large jobs, and changed 
considerably for short jobs where the job execution times 
became comparable with the times required for resource 
monitoring and job scheduling. CODINE appeared to be 
particularly efficient for small jobs, while the relative 
performance of PBS and Condor . improved with the 
increase in the job size and the job submission rate. LSF 
was the only system that performed uniformly well for all 
job sizes and submission rates with the exception of the 
turn-around time for .small jobs and large submission 
rates: 

Despite the limitations resulting from the relatively 
small size of our Micro-Grid testbed and a limited set of 
system settings exercised in our experiments, the practical 
value of our empirical knowledge comes, among the 
other, from the following factors: 

• Even though our benchmarks and experiment times 
seem to be relatively short compared to the real-life 
scenarios, we make up for that by setting the average 
time between job submissions to the relatively small 
values. As a result; the systems are fully exercised, 
and our results are likely to scale for more realistic 
loads with proportionally longer job execution times 
and longer times between job submissions. . 

• Typical users rarely use all capabilities of any 
complicated system, such as JMS, Instead, majority of 
Job Management Systems deployed in die field use the 
default values of majori^ of configuration paranieters. 

. Additionally, to our best knowledge, our study is the 
first empirical study of Job Management Systems reported 
in the literature. Our methodology and tools developed as 
a result of this project may be used by other groups to 
extend the understanding of similarities and differences 
among behavior and performance of existing Job 
Management Systems. 
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How Sun™ Grid Engine, Enterprise 
Edition 5.3 Works 



This document shows how Sun" Grid Engine, Enterprise Edition (SGEEE) software 
policies function to create a productive compute intensive environment. It explains 
how SGEEE software provides solutions for executives and technical staff involved 
in implementing projects that require the delivery of abundant compute power in an 
enterprise setting. 

How This Document Is Organized 

The first section provides an overview of the SGEEE software computing 
environment. Then it explains the function of policies in the SGEEE software 
environment. Finally, the document gives examples of how these polices can be 
applied to channel compute resources efficiently in the enterprise environment. 

This document refers to Sun Grid Engine, Enterprise Edition only. Complete product 
documentation, as well as other product information, including information about 
Sun Grid Engine (Standard Edition) is available at 

hctp : //www . sun . com/gridware 
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What Is Grid Computing? 

Conceptually, a Grid is quite simpl^it is a collection of computing resources that 
perform tasks. It appears to users as a large system, providing a single point of 
access to powerful distributed resources. Users treat the Grid as a single 
computational resource. Resource management software, sudi as Sun Grid Engine 
Enterprise Edition, accepts jobs submitted by users and sdiedules them for 
execubon on appropriate systems in the Grid based upon resource management 
poliaes. Users can literally submit thousands of jobs at a Hme without beine 
concerned about where they run. 

No two grids ate alike; one size does not fit all situations. There are three kev classes 
of grids, which scale from single systems to supercomputer-class compute farms that 
uhlize tiiousands of processors: 

> Cluster Grids are the simplest, consisting of one or more systems working together 

to provide a single point of access to users in a single project or department. 
■ Campus Grids enable mulHple projects or departments, within an organization to 
share computing resources. Organizations can use campus grids to handle a wide 
vanety of tasks, from cyclical business processes to rendering, data mining, and 



. Global Grids are a collecHon of campus grids that cross organizational boundaries 
to create very large virtual systems. Users have access to compute power that far 
exceeds the resources available within their own organization. 

Sun Grid Engine Enterprise Edition (SGEEE) v5.3 Beta software, the newest version 
of bun s resource management software solution, provides the power and flexibility 
required for Campus Grids. SGEEE software orchestrates the delivery of 
computational power based upon enterprise resource policies set by the 
organization's technical and management staff. SGEEE software uses these policies 
to examine the available computational resources within the Campus Grid gathers 
these resources, and then allocates and delivers them automaHcallv in a way that 
optimizes usage across the Campus Grid. 

To enable cooperation within the Campus Grid, project owners using the Grid need 
to negotiate policies; have flexibility in the policies for manual overrides for unique 
project requirements; and have the policies automatically monitored and enforced 
For example, SGEEE can allocate compute cycles to a project with an immediate 
deadline. One automobile manufacturer uses SGEEE for running car crash 
simulation projects A government agency runs up to 24,000 simultaneous jobs to 
produce their monthly budget reports. 
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Policy Overview 



PoHdes are determined by the particular needs of the organization at any moment; 
they are implemented by the compute farm system administrator There are four 
poMcy systems in SGEEE software. 

■ The share based or share tree policy 

■ The functional policy 

■ The override policy 

■ The deadline policy 



Share Based Or Share Tree Policy 

The share based policy allocates a percentage of compute resources to all defined 
users of the compute farm: all the users share the resources as described in the cases 
above. In addition, the share policy addition adjusts for past usage of resources. In 
case a user accumulates more resource usage than due by the user's entitlement 
defined in the share tree, SGEEE software adjusts for this "over-usage" by lowering 
the entitlement of that user for a certain period of time until the user s resource 
usage meets the allocated entitlement. Conversely, the user's entitlement might need 
to be increased temporarily, to compensate for "under-utilization" of resources in the 
past. 



Override Policy 

The override policy is the straightforward assignment of tickets to a user (or a 
project or other categories) by the system administrator, altering the usage of the 
compute resources in the direction desired. Unlike deadline policy tickets, which are 
automatically withdrawn after the user's job executes, override policy tickets stay 
with the user until withdrawn manually by the system administrator. 



Functional Policy 

The functional policy is similar to the share tree policy, but it has no penalties or 
compensation based on past usage. It functions like a particular case of the share tree 
policy where the half-life is zero and the compensation factor is 1. 

For performance reasons, the functional policy is not implemented as a special case 
of the share based policy. Therefore, it does not present itself as a simplified share 
tree in the GUI. By activating both the share tree and the functional policies, a user 
could receive tickets from both policies, making the actual share distribution (not the 
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entitlements) less easy to track. Because of that, it is wise not to mix both share and 
functional policies during an initial installation at a new site. This poliq' can be 
activated at a later time during fine tuning. 



Deadline Policy 

The deadline policy allocates a large number of tickets to a user at a specified time 
when a job needs to execute. The number of tickets increases linearly up to the 
maximum number of tickets assigned to the user This policy will take away the 
extra tickets from the user after the job finishes. As with the functional policy it is 
advisable to acHvate a deadline policy as a separate step after the basic installation is 
complete- A deadline policy is really a fine-tuning of the system by an experienced 
system administrator When assigning Hckets with a deadline policy, the svstem 
admmistrator must by experienced enough to know how many deadline tickets need 
to be allocated so the job starts in time to meet the deadline. 



Policy Systems in Sun Grid Engine. Enterprise Edition 5.3 Software • November 2001 



The Share Based Policy: An Example 

To set up a share based policy, we have to calculate the percentage of the total 
enterprise's compute power (the compute farm) that a user or a department or a 
proiea is entitled to receive. This calculation is the share tree. In the example that . 
follows, we will set up a SGEEE software policy based on a share tree. 
(The figures used in these examples are arbitrarily decided - they are used only to 
show how a share tree policy works.) 

We calculated the amount of the share by starting from the root of the share tree 
shown in Figure 1. 
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FIGURE 1 Levels al the root of the share tree 



The Share Based Policy: An Example 



^tl'^rlT ^^'t based poliq, we need to determine the ratio of shares at each 
lll jLlr^"- ^1 '^^.«i!f'"P'tP'«-ded, there are two levels, the department!? 

Department #i: has 200 shares assigned 
Department #2: has 800 shares assigned 

At the departmental level, there are 1000 shares. Department #i has a 7CW ch. . 
entitlement; Department #2 has an 80% share enhSent ^ ^^^^ ^^"^ 

M the next level in this example the user level user shares are assigned for each 



Department 1 

Department #1 has 2 users (or projects). User A and User B. 
User A: has 500 shares assigned 
User B: has 500 shares assigned 

th;ie"p";t!::em^""'^''' ^ '"'^'^ ^"""^--^ ^^ares available to 

In terms of the total number of shares in the grid cluster. User A has 10% sha,^ 

"x ?ri:^to?StT"T T entJtlem^nt^ftZtTt 
has lof^hTr. Jl° ^" °f oser A at the departmental level) User B also 

has 10% share of the total resources available to the cluster. ^ o aiso 

Department 2 . 

Department #2 also has 2 users. User c and user D. 
User C: has 750 shares assigned 
User D: has 250 shares assigned 

In terms of the total number of shares in the grid cluster User C has ch, . 
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This first share distribution used is showrt at the top of Figure 2, Entering Tickets. 



Policy A 
1,000 tickets 



^ 4 ^ 
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Department 
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Entitlemenis 
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s 

10% 


n%-em^80% 


»%' 
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Useran^ 
Project Level 
Enttttenwm 


A B 
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User or 
Praject 
Name 

Case 1: 
Only B in the 
System 


\'\ 


loooj 


0 


0 


Ticket 
distribution 


If B only uses at given moment the system. B has 
100% of the resources, in spite of only 10% entitlement 
8 gets all 1.000 tickets 

•Jl. • 4 


Case 2 
Onty B and 
D are in the 

System 


0 


333 1 


0 1 


667 j 


Ticket 
distribution 



If B (10% entiUement) and D {20% entiUement) onty 
use at given moment the system, they share the 1 .000 
tickets only among themselves. B has 33.3% and D 
66.7% . B gets 333 Uckets and O gets 667 tickets. 



FIGURE 2 Entering Tickets 

In the grid cluster, work, in the form of a compute job, is called a ticket. Once work 
is requested by a user (or users) within the grid cluster, computational resources are 
assigned according to the share entitlement policy. 

If 1000 tickets are assigned to this grid cluster by the system administrator, the 
tickets will be distributed to the grid cluster in a manner consistent with the 
entitlements of the share tree, the share policy set for this grid cluster. 
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FIGURE 3 Assigning tickets to a policy. 



The fickets are distributed to all users or projects active at any given time according 
to the share policy. The following three cases illustrate the allocaHon of compute 
resources according to this share tree entitlement 



Case 1: Only User B Is Active. 

In case 1, User B is the only active user on the grid cluster User B has an 
entitlement of 10%, but because there are no other users submitting tickets User B 
has 100% of the total 1000 active tickets. In case h User B receives all 1000 Hckets 
for compute resources. See "Entering Tickets" on page 7. 



Case 2: Users B And D Are Active. 

In case 2, both User B and User D are active. User B has 33.3% actual usage still 
much higher than the assigned entitlement of 10%. User D has 66.6% actual usage 
much higher than the assigned entitlement of 20%. In this case, the 1,000 tickets are 
distributed as follows: user B gets 333 tickets and User D gets 667 tickets Note 
that Sun Grid Engine, Enterprise Edition does not allow fractional tickets.Aeain, See 
Entering Tickets" on page 7. 
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Case 3: Adding Override Policy Tickets 

»tWonlv users B and D are active but that User B needs 
For case 3, assume thai only , ^^^^ admii>istrator issues 

tickets for a week. 

active, 
tickets. 

During this time, User D has 667 share policy tickets. 

AS a percentage of the total ^^-^^''iZ'SmZV.To h«%T(667n700) 
1700) of the available resources (up from 33.37o) and user 

of the available resources (down from 66.7%). 




FIGURE 4 The Override Policy with 700 tickets assigned to User 

The Share Based Policy: An Example 



The Half-life Factor 



resource consun^pHon, be it six ^onl^ "^^^^^^^ ^ 

share tree. Sun Grid Eneine Entemrioppl^- • ^ °" *^ 

consumption. ^ ' ^"^^"P"" niamta.ns a reconl of users' resource 

-urce usage in this ^^tr^'rhri:" ^s,rmto7a?r 

resources consumed over a "slidin- window of kZ- . ' u computer 
determined by a "halMife" factor, whTc^ t Sun Grid P^^^ ''^ 
internal decay function. This decay funcLa„ »^ ^"'"P^^e Edition is an 

consumption over t,me. A sloS^^S^Tsel: ZT' ^^^""'^ 

=p^S:^'''"«^'''^''-'"^« 

Sam^pifa^js^s o^i^; ;:;^'is;s^or^°" ^ ^^^^^^^^ 

in the following usage "p^^^''^:::::^Z^^^-^'^'^ -"''^ 

- 500 after 7 days 

- 250 after 14 days 
• 125 after 21 days 
. 62.5 after 28 days 

description of the decay hinct°onVs sJc^^^ negligible. The exacf 

receives override HcketlS a^'^^SJIliJl^^^^^ ' ''"^^ ^ 
belong to a different poliqr s^tem X^^!^? °f P"'' "f ^e penalty as they 
tree policy only. ^ ^ '^"^''y ^"'hon is a characteristic of the share 

The CompensaHon Factor 

"compensaHon facto^Srfartor deJ k ' ""-'J 

user who was prevTJus^Lcdve Thl !h' r'"*"" adjustment for a 

automatically by an ^mLurr^iJ- f''"^^^"* compensates the user 

«he share tre^ ^icy ' '"^"^ term entitlement defined by 
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For example, if the entitlement defined by the share tree policy for User B is 10% 
and the compensation factor assigned to this user is 5, then the system can raise user 
B's short term entitlement up to 50%. 



FIGURE 5 Typical half-life and compensation factor set up 



Case 4: Half-life And Compensation Factors 

This case is a modification of Case 2. The original share entitlements for Case 2 were 
as follows: 

User A = 10% 

User B = 10% 

user D = 20% 

User C is inactive 

The total resource entitlement for these three users A,B and C is 40%. In this case, the 
1000 tickets are distributed as follows: 

User A = l,pOO*10%*(100/40) = 250 tickets = 25% (current) 

User B = l,000n0%*(100/40) = 250tickets = 25% (current) 

User D = 1,000*20%*(100/40) = 500 tickets= 50% (current) 

Now assume User A starts using the system after User B and User D already 
utilized the cluster for some time. 

In this case. User B and User D have accumulated more resources than they were 
entitled to because User C has been inactive for some time. For example. User B 
might have consumed 33% of the resources and User D 66% of the resources. 
Meanwhile, user User C consumed less than 1%. Sun Grid Engine, Enterprise 
Edition could dynamically adjust the entitlements for a short time so that User A 
gets 70% of the resources, while User B receives 10% and User D receives 20%. 



Halflife 
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However, if a compensation factor of 5 is assigned to User A, the most resource 
usage User A could be allocated would be 5 limes the 10% share policy 
entitlement, a maximum of 50%resource allocation. 

In this case. User B might get 17% of the available resources and User D would 

receive 33%. 

User A with 50%, up from 25% 

User B with 17%, down from 25% 

User D with 33%, down from 50%. 

Note that this short term entitlement adjustment will change dynamically as User C 
accumulates resource usage according to the long term entitlement defined by the 
share tree policy. 
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Defining Compute Resources In Grid, Enterprise 
Edition 

As part of the share tree policy. Sun Grid Engine, Enterprise Edition software allows 
the system administrator to assign factors of relative importance among three types 
of compute resources: CPU cycles, I/O activity, and memory usage. These three 
resources are defined as follows: 

■ CPU cycles, as measured by the operating system 

■ Memory, measured as a mathematical integral over time 
(2 GB over 2 days equals 4GB for 1 day) 

■ I/O as reported by the operating system 

For example, if only the CPU cycle is considered as a compute resource the system 
administrator will assign a value of 100% to the CPU and zero to the other two. 
Using three sliders in the GUI, adjustments in resource allocation can be made. 



Share Tree Policy Parameters 
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FIGURE 6 Usage definition in the share tree GUI 

Sun Grid Engine, Enterprise Edition software can account for accumulated 
consumption according to these definitions and can compare .t wth the ent, lemenl 
of users and projects in the share tree policy to calculate consumption penalt.es. 



Denning Compute Resources In Grid. Enterprise Edition 



Priorities in SGEEE Software 



tISer "^P*"'"""' * ''OW and 100 jobs subn,i«ed. Each job has 90 
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Appendix 1: Mathematical Definition Of the 
Decay Factor 

The actual formula used to define SGEEE software s decay factor is derived from 
nuclear physics 

Radioactive decay is described by the following formula: 

NU) - Sm * exp(-D't) 
m) is the number of particles at time t. N(0) is the initial number. D is the decay 
fJte, a material constant, and / is time. If you know the half-life t(W, then w.th th.s 
formula you can compute D: 

D = -log(H2)'f(W 

With this D, for any fixed time interval you can compute a constant factor 
(evaluating the e-function using the constant D and t valu^) '^"«"8; ""-"J^; «f 
particles a! the beginning of the interval into the resulting (not decayed) number of 
particles at the end of the interval. If we call this factor K, we have: 

Mend) = N(begin) ' R (Note: 0 < R < J ) 

You can use this iteratively with Mend) equal to megin) in the succeeding interval. 

This is exactly what is done in SGEEE software. With the half-life time factor 
^cSS^ L administrator. R is computed for the GRD scheduler interval The" 
during every scheduler interval, all the accumulated resource consumption for each 
user and project is decayed by the simple multiplication above. 
Of course, at each scheduler interval, additional "new" usage may be added tolhe 
accumula ed usage of any user/project. This is not problem, though as .t .s f^sh 
^ to speak and hence not decayed. This fresh usage w.ll be decayed for the first 
Sne inV next scheduler interval together with the other ---"'^'f ^^S^,^^^^ 
applying the multiplication formula above to the current sum of usage ««ch use^ 
or project The "older" usage contribution is, (the longer ago it has been added he 
LKin it will have bein multiplied (decayed) and the smaller its .mpact w.ll be 
in determining any consumption penalty. 
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